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ih reinforcement learning and control theory. 


hs The developed algorithms from the basis of various applications such as— 
4201012 (i) Vision processing (ii) 


Language processing 
(iii) Forecasting (e.g., stock market trends) 
(iv) Pattern recognition (v) Games 
(vi) Data mining 


; (vii) Expert systems 
(i2stot® (viii) Robotics. 


t learning, R 


equations, vaiue iterati L-framework, MDP, Bellman 2.2. Write the applications of machine learning. 
- Model, Q-learnj ion and policy iteration, actor-critic my Ans. The icati F : : r 
IN, SARSA... 4 a 35t01 “He applications of machine learning are as follows ~ 
SUN re setiernmnmnsen © Medical diagnosis Gi) Machine perception 
Support vector machine: (iii) Game playing üv) Information retneval 
ele learning in compe? Sten learning, application of vy) Brain machine interfaces (vi) Affective computing 
competiig 29° rocessing e en Speech processing, e (vii) Natural language processing {viii} Recommender systems 
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(xi) Speech and handwriting recognition 
(xii) Stock market analysis E 

(xiii) Structural health monitoring 

(xiv) Syntactic pattern recognition 

(xv) Computer vision, including object recognition. 

What are machine learning tools ? Explain. 


0.3. 


Ans. Machine learning gives a set of tools that use computers to transir 
data into actionable information. Tools are a big part of machine learnings 
choosing the right tool can beas important as working with the best algoritn 
Machine learning tools make applied machine learning faster, easier. Go 
tools can automate cach step in the applied machine learning process} 
shortening the time. 

The machine learning tools are as follows — 


(i) Platforms — Platforms are used to complete machine leamin 

project from beginning to end. 

(a) Provide capabilities required at each step in a machitt 
learning project. 

(b) The interface may be graphical or command line. 

(c) They provide a lose coupling of features. < 

(d) They are provided for general purpose use and W 
rather than speed, scalability or accuracy. 
art of 


_ (ii) Library — Library gives capabilities for completing P’ 
machine learning project. i 


; (a) Provide a specific capability for one or more te 
machine leaming Project. i 4 j 
. i a 

(b) The interface is typically an application program 
iring programming. 


(c) They are tailored for a specific use case, 


interface requ 


ý 
i problem W 
environment. 


Gii) Graphical User Interfaces — 


y 
(a) Allows Jess ; 


må 
-technical users to work through 


leaming, 
d 
2 (b) Fo m mae 
learning techniques, cus on process and how to get the most fro i 
C) Stro ' jafo 
such as visualization, nger focus on graphical presentations of 


d n inter 
(d) Structured Process imposed on the user bY the in! 
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(iv) Command Line Interface — 
(a) Allows technical users who are not programmers to work 
through machine learning projects. 
(b) Frames machine learning tasks in terms of the input required 
and output to be generated. 


(c) Promotes reproducible results by recording or scripting 
commands and command line arguments. 


(v) Application Programming Interfaces — 

(a) To incorporate machine learning into our own software 
projects. 

(b) To create our own machine learning tools. 

(c) Gives the flexibility to use our own processes and 
automations on machine learning projects. 

(d) Allows combining our own methods with those provided 
by the library as well as extending provided methods. 


(vi) Local Tools — Local tools can be downloaded, installed and run 
on local environment. 
(a) Customized for in-memory data and algorithms. 
(b) Control over run configuration and parameterization 
(c) Integrate into our own systems to meet our needs. 


(vii) Remote Tools — Remote tools can be hosted on a server and 
called from local environment. These tools are often referred to as machine 
learning as a service (MLaaS). 

(a) Tailored for scale to be run on larger datasets. 


(b) Run across multiple systems, multiple cores and shared 
memory. 


Q.4. Write short note on artificial intelligence vs machine learning. 

Ans. Artificial intelligence may be broadly defined as machines those 
having the ability to solve a given problem on their own without any human 
intervention. The solutions are not programmed directly into the system but 
the necessary data and the Al interpreting that data produce a solution by iiself. 
The interpretation that goes underneath is nothing but a data mining algorithm. 

Machine learning takes promote the approach to an advanced level by 
Providing the data essential for a machine to train and modify suitably when 
exposed to new data. This is known as “training”. It focuses on extracting 
information from considerably large sets of data, and then detects and identifies 
underlying patterns using various statistical measures to improve its ability to 
interpret new data and produce more effective results Evidently, some 
Parameters should be “tuned” at the incipient lev el for better productivity. 
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Machine learning is the foot hold of artificial intelligence, Itisim 
to design any machine having abilities associated With inte 
language or vision, t0 get there at once. That task would have rat s i 
impossible to solve. Moreover, a system cannot be considered cons 


oe See a . ™Mply 
intelligent if it lacked the ability to learn and improve from its pr 
rt i 


, proba, 
ligence i 


exposures- 
Q.5. Write and explain scope of machine learning. 
Ans. The scope of machine learning are as follows ~ 


() Explaining Human Learning — A mentioned earlier, machi 
learning theories have been preceived fitting to comprehend features of learns 
in humans and animals. Reinforcement learning algorithms estimate ‘ 
dopaminergic neurones induced activites in animals during reward-b: 
learning with surprising accuracy. ML algorithms for uncove ti 
sporadicdelineations of naturally appearing images predict visual feature 
detected in animals initial visual cortex. Nevertheless. the important drivers: 
human or animal learning like stimulation, horror, urgency, hunger, instinctiv 
actions and learning by trial and error over numerous time scales, are not 
taken into account in ML algorithms. This a potential opportunity to discow 
a more generalised concept of learning that entailsboth animals and machit 


(ii) Programming Languages Containing Machine Learnit 
Primitives — In majority of applications, ML algorithms are incorporated ¥# 
manually coded programs as part of an application software. The need po 
new programming language that is self-sufficient to support manually wr 
subroutines as well as those defined as “to be learned””s Programming Jangu 
like Python (Sckit-leam), R etc. already making use of this concept in smile 
scope. But a fascinating new question is raised as todevelop a model to d!” 
relevant leaming experience for each subroutines tagged as “to be leame 


a ab J so 
ae and security in case of any unforeseen modification to the prow 
unction, 


woe, my Ci i a 
i (iii) Perception ~ A generalised concept of computer percept mo 

can ink ML algorithms which are used in numerous form of mee 

Perception today including but not limited to highly advanced vision. P 


er io i oh 
oe ete., is another potential research area. One thought Pe ; 
‘eg os integration of different senses (e.g., sight, pom a a 
roe s y em which employ self-supervised learning to estimate ° ey w 
n gc using the others, Researches in developmental psycho af 

oted more effective lear; i i 3 


ines $ ‘ dalit 
suppli ning in humans when various input 1° 
supplied, and studies ing vata E 
: MOIS OF CO-training methods in sinuate similar ™ 
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Q.6. Write the advantages of machine learning, 


Ans. The five advantages of machine learning are as follows — 


(i) Accurate — Machine learning uses data to discover the optimal 


decision making engine for your problem. As you collect more data, the 
accuracy can increase automatically, 


(ii) Automated— As answers are validated or discarded, the machine 
learning model can learn new.patterns automatically. This allows users to embed 
machine learning directly into an automated workflow. 


(iii) Fast— Machine learning can generate answers in a matter of 
milliseconds as new data streams in, allowing systems to react in real time. 


(iv) Customizable —Many data-driven problems can be addressed with 
machine learning. Machine leaming models are custom built from your own 
data, and can be configured to optimize whatever metric drives your business. 


(vy) Scalable — As your business grows, machine leaming easily 
scales to handle increased data rates. Some machine learning algorithms can 
scale to handle large amounts of data on many machines in the cloud. 


0.7, Write the disadvantages of machine learning. 


Ans. The disadvantages of machine learning are as follows — 


(i) Machine learning has the major challenge called acquisition. Also 
based on different algorithms data need to be processed. And, it must te 
processed before providing as input to respective algorithms. Thus, it has a 
Significant impact on results to be achieved or obtained. 


(ii) As we have one more term interpretation. That it result is also a 


Major challenge. That need to determine the effectiveness of machine learning 
algorithms, 


. Gii) We can say uses of machine algorithm is limited. Also, it's aot 
having any surety that it’s algorithms will always work in every case imaginable. 
AS we have seen that in most cases machine learning fails. Thus, it requires 
Some understanding of the problem at hand to apply the right algorithm. 

.____ GY) Like deep learning algorithm, machine learning also needs a a 
Sf training data. As we can say it might be cumbersome to work with a large 


amount of data. Fortunately, there are a lot of training data for image recognition 
Purposes, 


(V) One notable limitation of machine learning is 1S suscepti ae 
rynjolfsson and McA fee said that the actual problem with this f . sai 
That when they do make errors rene 
ult. As because it will need going through the underly 


Errors, B 
fact, 7 
diffic 


diagnosing and correcti 
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a on regression. 

son is used to map a data item to a real valued preg 

iy ssion involves the learning of the function tae 
a assumes that the target data fit into some S 
this MappIne- (e.g. linear, logistic etc.) and then determines the -o 
typeof enn models the given data. Some type stator ane = 
ite ih function is “best”. Standard linear is a simple examp, 
ee on example, a college professor wishes to reach a certain my 
vines before his retirement. Periodically, he predicts what his retiren, 
savings will be based on its current value and several past values. He uses 
simple linear regression formula to predict this value by fitting past behavig, 
to a linear function and then using this function to predict the values at poy 
in the future. Based on these values, he then alters his investment portfolio, 


0.8. Write short note 


Ans. Regress) T 
iable. In actuality, regte 


var i 
ng. Regression 


Q.9. Discuss straight-line and multiple linear regression analysis, 


Ans, Straight-line regression analysis involves a response variable, y,a 
a single predictor variable, x. It is the simplest form of regression and mod? 
y as a linear function of x. That is, 

y= b+wx 

where the variance of y is assumed to be constant, and b and w are regressitt 
coefficients specifying the Y-intercept and slope of the line, respectively, Te 
regression coefficients, w and b, can also be thought of as weights, so that® 
can equivalently write, 

Th , Y= Wot wx. k 
Simai S can be solved for by the method of least'squares, wilt 
the actual data and the, ie ict Sri one that sainimigs thogaa oa 
values of predictor var} roa ofthe line. Let D bea training ak coisa 

ab € X, for some population and their associated val" 


for response variabl 
e, y. ini i 
&1 YD), (2, yp), .... Y. The training set contains [D| datapoints of the fY 


using this method ot ad ae The regression coefficients can be estim“ 
ollowing equations — 


an Valu 

=» Yip The cog ce Of x), x 
effici i 

ents Wo and 


$ 
otherwise comp}; 2 ~» Xpand Y is the mean value of Yi", 
leated Tegr 
C: 


- i mation 
Si Wi often provide good approximatio? 
Sion equations, 
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inle linear regression is an extension of straight-line regression sọ as 
Multiple i han one predictor variable. It allows response variable y to be 
olv a tn a teetee of, say, n predictor variables or attributes. A}, A>, 

a a tuple, X. Our training data set, D, contains data of the form 
we Åm a ya) = (Xipp Yip), where the X; are the n-dimensional training 
(Xr nt pene me class labels, yi- An example of a multiple linear regression 
A d on two predictor attributes or variables, A} and A>. is 


to inv 
modeled as a 


tuple 
model base: 
F o Di + W2X2, 


here xj and x2 are the values of attributes A, and A>, respectively, in X. 
w 
Q.10. Explain regressionand log-linear models. 


Ans. Regression)and log-linear models can be used to approximate the 
given data. In (simple) linear regression, the data are modeled to fit a straight 
line. For example, a random variable, y (called a response variable), can be 
modeled as a linear function of another random variable, x (called a predictor 
variable), with the equation 

y= wxtb 

where, the variance of y is assumed to be constant. In data mining, x and y are 
numerical database attributes. The coefficients, w and b are called regression 
coefficients, they specify the slope of the line and the Y-intercept, respectively. 
These Coefficients can be solved for by the method of least squares. which 
minimizes the error between the actual line separating the data and the estimate 
of the line. Multiple linear regression is an extension of (simple) linear 
regression, which allows a response variable, y, to be modeled as a linear 

function of two or more predictor variables. 


linear models approximate discrete multidimensional probability 
ons. Given a set of tuples in n dimensions (e.g., described by n 
Latine hs can consider sach tuple as a point in an n-dimensional space 
multidimensional A n-o used to estimate the probability of each point in a 
subset of Grensa one, a set of discretized attributes, based on a smalier 
Space to be pave ne combinations. This allows a higher-dimensionai data 
are therefore Structed from lower-dimensional spaces. Log-tinear models 


a seful for di 
i Iso useful for dimensionality reduction and data smoothing. 
“session and log- 


the; 

reat aPplication may be 

han Sion does exceptio: 

searag Pied to high-dj 
lity for upto 10 c 


Log- 
distributi 
attributes 


dels ca 


linear models can both be used on sparse data, although 
limited. While both methods can handle s 
nally weli. Regression can be computationally intensive 


wed data, 


mensional data, whereas log-linear models show good 
Ir so dimensions 
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oul Illustrated the simple linear regression and mul 
regression derivation. E 

Ans, In linear regression, the model specification is that hea 
variable, y, is a linear combination of the parameters (but need no 
the independent variables). For example, in simple linear re 
modeling n data points there is one independent variable 
parameters, By and pi- 

y= Bot Bš; t£ 15 12535 


t be Tine, 
BTCssign h 
~ Xi and ty 


on. 


In multiple linear regression, there are several independent Variables 
: ; i 
functions of independent variables. 


Adding a term in xe to the preceding regression gives — 


yi= Bot Bixi + Box? +, i= L, 2, 3... N. 


This is still linear regression; although the expression on the right hay 
side is quadratic in the independent variable x;, it is linear in the parame: 
Bo: B; and Ba. 

In both cases, g; is an error term and the subscript i indexes a particu: 
observation. 


Returning our attention to the straight line case — Given a random samp: 
from the population, we estimate the population parameters and obtain t 
sample linear regression model — 

Ji = Bo +B)x; 

The residual, ej = y;-¥;, is the difference between the value oft 
dependent variable predicted by the model, $,, and the true value of 
dependent variable, yj. One method of estimation is"ordinary, least squa" 
mg method obtains parameter estimates that minimizethe sum of squ 
residuals, SSE, also sometimes denoted RSS; 


n 
SSE= Ye 
“em inl j 
__ Minimization of this function results in a Set of normal equations. 8%" 
simultaneous linear equations in the parameters, which are solved to yiel 
parameter estimators, Bo. - 3 


In the case of si : astin" 
are — simple regression, the formulas for the least squares > 


ia Zudi- 
E(x; =x)? 


and fy =7-ĝ;X 


cee yal 
where X is the mean (average) of the x values and 7 is the mean ofthe y” 


iple Ting, 
} 


Pred 
Of pred ae 
n Prediction 
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Under the assumption that the population error term has a constant 
: ee s : 
ariance, the estimate of that variance is given by ~ 
va 


3. SSE 


Ge 


n-2° 


This is called the mean square error (MSE) of the regression. The 
denominator is the sample size reduced by the number of model parameters 
estimated from the same data, (n =p) for p regressors or (n — p- ifan 
intercept is used. In this case p=] so the denominator is n - 2 


The standard errors of the parameter estimates are given by 


Under the further assumption that the population error term is normally 
distributed, the researcher can use these estimated standard errors to create 


confidence intervals and conduct hypothesis tests about the population 
parameters. 


-20 ~10 w å 2 3% ee s 4 


Fig. 1.1 Mlustration of Linear Regression on a Data Set 
Q12, Explain anatysis of variance for simple linear regression. 
Ans. When there is no association between Y and X(B; =). the bee 
ictor ofeach observation is Y = Bo (in terms of minimizing sum of ot 
errors). In this case, the total variation can be denoted as TSS 
dey, -Py 


Q > the total sum of squares. 
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When there is an association between Y and X hia, the bea 
e ervation 1S v= foo +X; {in terms of minimizing sum m 
i 


error variation can be dened 


of each obs o s 
of prediction errors) In this case, the 


Sy y- ¥,), the error sum of squares. 
' 


il 


‘The difference between TSS and SSE is the variation “explained” 
regression of Y on X (as opposed to having ignored X}. Jt 
5 


by 
represents y 


difference between the fitted values and the mean- SSR = LVF, 
ial 


regression sum of squares. 
TSS = SSE + SSR 


n > n i ú n . nen 
yoy -YP = YY - %)? -Tr 
= ist i=! 


Each sum of squares has a degrees of freedom associated with it} 
total degrees of freedom is df;,.., = n — 1. The error degrees of freede 
derro =N ~ 2 (for simple regression). The regression degrees of freeds’ 
Trcgression = | (for simple repression). 


rota = error * Hpregression 
n~l=n-2+1 


Table 1.4 


Analysis of Variance for Simple Linggr Ri 
_SS TMS TOF 
5 ‘ 

SSR = Sey, - VP 
tet 


` 
SSE = Dey; -Y 


the same observed X levels 
+ > aB | 
E {MSE} = o, EIMSR} = 0? + BFE xy 
al 0.13. Explain ist 


parsi analysis of variance for multiple linear regres” 
een co 'S NO association between Y and Xj. i 
p d est predictor of each observation is Y =foe (7 


al | 
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minimizing sam ofsquares e Aps ertues}. In this cave, the total wuriatior 
can be denoted as TSS = Li Yı- YY. the total sum of squares, aai 
simple regression. i 
When there Is 20 association t 
all B= 0), the best predictor of cach : 
zing sum Of sgu 


1 Y and at east ome X., -Xp fue 
aon m EA =o * BN +. +BAX 


inimi w emors}. in fian caie, thee error 
(in terms of minimi: ducts: m Hines cue. the erre 


yy. the emma sum of aguares. 
variation can be i 


X SSE is the vanation “explamed”™ by the 
X, (as opposed to having ignored X... K 3 Ht 


The difference 
regression of Y 9 


gach sum of squares has a degrees of freedom assoouted with a. The 
grees of freedom is df; 4, n 1. The error degrees of nae 
dferor = n ~ p'- The regression degrees of freedom 18 df, cence” P Note thit 
hen we have p = 1 predictor, this generalizes to simple regress 
Uroa! = UE poe + df Regression 
n-i=n-p+p 
Error and regression sums of squares have a mean sysnre which is = 
sum of squares divided by its corresponding uegrees of freedom MSE 
= SSE/(n - p') and MSR = SSR/p. It can be shown that these mean os 
have the following expected values, average values in repeated sampling 
the same observed X levels - 


E(MSE) = o?, E(MSR} 2 6° 
-2 Analysis of Variance for Multiple Lisear Regression M 
sedi aes katt : P mai l 
i “s Teto -valme ] 
À: l s MSR PIF npn Dst 
(Madej | -FF | MsR Faa * a (Pa ! 
{ H { int P | 
ore ee ee | ka ' 
| Reviduayy | 9-0" | SSE « Sey, C? | MBE = oh 4 
Tosat pp pe : 


(Corrected | B-11 | TS8=9 0; Sr: 
iii a S 
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Q.14. What is the concept of probability ? Explain wih toi 


Ans. The concept of probability is extremely important, It hag b 
ery extensive application in the development of all physical Science i 
hance of happening of an event when expressed quantitatiy, 


ely is cal, 
robability. t 


Theorem of Total Probability or Addition Law of Probab 


Tf the probability of an event happening as a result of a 
he probability of a mutually exclusive event B hap 
orobability of either of the events happening as a res 


P(A + B) or P(A U B) = P(A) + P(B) 


lity 
OF a trial is P(A) 
pening is P(B), th 
ult of the trial is 


X 


en thy 


Proof. Let n be the total number of equally likely cases a 
favourable to the event A and m, be favourable to the event B. Th 
of cases favourable to A or B is mı +m). 


Hence the probability 


nd let mh 
en the numb: 


of A or B happening as a result of the trial 


m+m mm 
= = + —* = P(A) + PB) 
n n n 


If the events A and B are not mu 
outcomes which favour boi 
included in both m) and m 

Hence 


tually exclusive, then, there are son: 
th A and B. If m3 be their number, then these: 


the total number of outcomes favouring either A or B or bothis 
my + mp —m3 
Thus the probability P(A + B) or P (AU B) of occurrence ofA and Bor Bo! 


= Mtm -m m3 


- Mig Mm _ 
n n n p 

or P(A + B) P(A) + P(B) — P(AB) 

PAU B) = P(A) + P(B) — P(A Q B} 

when A and B are mutually exclusive P(A A B) or P(AB) = 0 and we haè 


P(A + By or PA U B)= P(A) P(B) 
Particular Cases — 


(i) 


1 


or 


IfA and B are defined on the 
n(A U B) = n(A) + n(B) 
: PAU B)= P(A) + P(B) 
Gi) Since s and $ are mut 
KS Ug) = P(S) 
P(S) + P(g) = P(S) 
Pò) = 0 


sample space S, then 


m =$ 
ually exclusive events and S YÈ 


prow? 
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(iii) Let A and A be complementary events. 
xclusive by definition. Hence 


PAU A)= P(S) 
P(A) + P(A) =1 
P(A) = 1=P(A) 
P(A) sb- P(A) 


Then A and A are 
mutually € 


(PIS) = 1] 


o (iv) We know that 
A= (AQB)U(ANB) 
That A is the union of two mutually exclusive events 
P(A)= P(ANB)+ P(AMB) 
or P(A nB) = P(A)-P(ANB) 
or P(AMB) = P(A)-P(AMB) 
Similarly, 
P(A ^B) = P(B)-P(ANB) 
(v) IfB CA, then 


(a) P(A aB) = P(A) - P(B) 
(b) P(B) < P(A). 
Q.15. Define the following — 


i, ility i ii, bility mass function. 
(Ù Probability function (ii) Probability (RGPV,, June 2016) 


$ 
p |S 


Q 


Fig. 1.2 . 

gory and statisties, 
e probability that 
bility 


(ii) 
à Probabil 
a discrete 
Mass fune 
distributio 
Vari; 


Probability Mass Function—\n pirobaniligon 
ity mass function (p.m.f.) is a function that mei The probal 
Tandom variable is exactly equal to some ae uae probability 
tion is often the primary means of defining # alive iate random 
n, and such functions exist for either scalar or mu 

ables, given that the distribution is diserete. 
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function differs from a probability deat 


16 M 


A probability mass : : : LY fung 
(pf a the latter is ssociated with continuous rather than discrete wt 
Ct. a W 


es: the values of the latter are not probabilities as such: a pdf. ae 
| to yield a probability. Sth 


variabl ; 
integrated over an interva 

0.16. Write the expected value (or mean) and variance ofa dien 
probability distribution with example. N 

Ans. Since the probability distribution for a random variable is a model; 
the relative frequency distribution of a population (think of organizing re 
large number of observations of discrete random variable into a relative frequen, 
distribution — the probability distribution would closely approximate this) ¢ 
analogous descriptive measures of the mean and variance are important concep 

The expected value (or mean), denoted by E(X) or p, of a discrete Ber 
variable X with probability function p(x) is defined as follows — 


E(X)= x p(x) 
where the summation extends over all possible values x. 


aana 1 
Note the similarity to the mean oe fx(wheren ==f) of a frequen 


di stribution in which each value, x, is weighted by its frequency f. For a probabil 
distribution, the probability p(x) replaces the observed frequency f. 


The variance of a random variable X, denoted by Var(X) or o?,isidefini 
as the expected value of the quantity (X — p)? where p is the mean of X, iċ 


Var(X) = E[(X—p)?]= F(x -p poo) 


The standard deviation of X is the square root, o, of the variance. 


For Example — 


Consider an experim i ing a # 
; ‘ossin| 
coin three p ent that consists of tossing 


probabili ao let X equal the number of heads observed, describe" 
a y istri itish of X and hence determine E(X) and Var(X). 

THT. TH ath Possible outcomes here HHH; HHT, HTH, THH. Ht 
à 7 TTT. This gives Probability distribution as follows — 


[x] Outcomes | Probability | 

0 TTT 1/8 

1 | HTT, THT, TTH 3/8 

2 | HHT, HTH, THH 3/8 
Ea = HHH 1/8 
|__| Total Probability iE 1.0 
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The expected value of X is — 
E(X) = 0(1/8) + 1(3/8) + 2(3/8) + 3(1/8) 
=0 + 3/8 + 6/8 + 3/8 
12/8 
=1.5 
The variance of X is — 
Var(X) = (0 — 1.5)?(1/8) + (1 ~ 1.5)°%(3/8) + (2 ~ 1.5)°(3/8) 
+(3- 1.5P0/8) 
= 2.25(1/8) + 0.25(3/8) + 0.25(3/8) + 2.25(1/8) 
= (2.25 + 0.75 #0.75 + 2.25)/8 


= 0.75 
There is an alternative way of computing Var(X). It can be shown that — 
Var(X)= £x? p(x) - [Ex p(x)]?. Using this result we have the table 1.3. 


Table 1.3 
x | peo | xp | x? pe) 
o lig] o 0 
1 | 3/8] 3/8 | 3/8 
2 | 3/8} 6/8 | 12/8 
3 |18| 3/8 | 9/78 | 
[Total 10 | 1s | 3.0 | 


Var(X) = 3.0 — 1.5? = 3.0 — 2.25 = 0.75. 


0.17. What do you mean by probability distribution and discrete 
Probability distribution ? 


e a Probability Distribution — We have explored the idea igesa, 
the 7 consider the concept of a probability distribution. In situations ja a 
3 vanAble being studied is a random variable. then this can often be modelled 
om Probability distribution. Simply put, a probability distribution bas two 
Ponente ~ ‘the collection of possible values that the variable can take, 


together wi 
‘al her with the probability that each of these values (or a su 
Mues) Occurs, ý 


bset of these 


‘Now, theres deterministic modelling (in which there is no rui ' telio 

and some or many different functions available for doma ar motions; 

Polynomial f these are used more often than others eg. ie ne situation 
™ functions and exponential functions. Exactly the same 
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in stochastic modelling. Certain probability distributions are used 
s ese include the Binomial distribution, the Po 


A Saas i 
distribution, the normal distribution and the E 
lve 


yrevails 
>ften than others. Th 
jistribution, the uniform 
exponential distribution. 
Discrete Probability Distributions = A discrete random variable assy 
each of its values with a certain probability, ie. each possible value of th 
random variable has an associated probability. Let X be a discrete aie 
variable and let each value of the random variable have an associated probabil, 
denoted p(x) = P(X = x), such that b, 
xal 


X: | xı | 32 of r 
po: ipile [-[-[-[ Po | 
‘The function p(x) is known as the probability distribution of the randon 
variable X if the following conditions are satisfied — 


G) p(x) 20 for all values x of X 
(ii) Xps! 
x 


p(x) is also referred to as the probability function or probability mass function 


Q.18. What do you mean by probability density function ? 
[R.GP.V., May 2019 (IL-Sem\] 
Rie Ans. A discrete random variable, probabilities are associated with particu! 
individual values of the random variable and the sum of all probabilities’ 
one. For a continuous random variable X, we do not have.a formula whic 
gives the probability of any particular value of X. The probability, the!’ 
Continuous random variable X assumes a specific valué x is takemto be 2" 
For a continuous random variable we deal with probabilities of inten 
rather than probabilities of particular individual values. 
Plead eae distribution of continuous random variable po 
N N 3 a function f(x) knowtias the probability density func™ 
Paes ee ihe same as the probability. function in the discrete €.. 
density function d ity that X is equal a specific value is zero, the probabi z 
ienen br p roes not represent the probability that X = x. Rather, it prov 
ae y Which the probability of an interval can be determined. ; 
she ichiientnmth Hous random variable X, provided that the’ 
a way that the total area under the curve is On® 


Let X be a contin . 
S ra ari i 5 ; 40 
probability density Arie, mnam variable. The function f(x) !S said 
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a 92 0, for every value of x 
Gy f6 fods] 
b 

ad. a P(asX Sb)= | f(x)dx 
al 
for any a and b. 

Note — The third condition indicates how the function is used — 

The probability that X willassume some value in the interval [a, b} = the 
area under the curve between x =a and x =b. This is the shaded area as shown 
in fig. 1.3. 


Probability P(asXSb) 
Density 


Function 


a b x 


Fig. 1.3 


In general the exact evaluation of areas requires us to use the probability 
density function f(x) and integral calculus. These calculations are time 
consuming and not straightforward. However, probability values can be 
obtained from statistical tables (just as for discrete probability distributions). 


0.19. Write short note on statistics. 


i Pers Sys calculation involving the ratio of size of difference (numerator) 
termed g ie ong (denominator), illustrated in the table, produces a number 
ratio cin a tF and %7 are statistics. This value becomes larger as the 
A larger ee (or relationship) to error increases, as the formulae indicate 
& istic obviously is better. 

San is compared to a table of statistic to determine if it is large 
heed not aie the null hypothesis. As the sample size increases, the sta € 
arger sam ie large to reject the null hypothesis. The reason for this ts that 
unrepresent, > give more confidence that the obtained sample is not 
notorious] ae of the population, that is, not a “fluke”, Smalt samples are 
Population will Leta, because every small sample that ts drawn from s 
that the ratio ‘Tbe distinctly different due to chance. Therefore, to be confident 

is sufficiently large to reject the null hypothesis when the sampie 


all, he Statisti st be la 

i the s tatistic must l rger. ™ 

the sta F culate 
i tatis ties cou se, we would learn the sever? l formulae used to calculate 


Iber ‘ F i 
'S We need in order to determine if we can reject the null hypott 


is sm 


vesis. 
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of these formulae are presented here to point out their simian 


A few ee 
Conceptual, not actual, formulae are used. The different types of Formal 
soe a, 
shown in table 1.4. t 
Table 1.4 
Name of Test Formula (Conceptual) Whar is Dow 
ifference 7 
t-test t= Mean difference Difference betwee 
7 Error variance two means 
Analysisof |p Differences among means Differences amon 
jar s F we any me: a 
variance Error variance within groups rind IV. ae with 
Extent to which frequencies arenot | Differences in 
Chi-square x2 i consistent with the null hypothesis frequencies; o 
test Size of sample relationship betwen 
nominal variables 


0.20. Explain the different types of statistics and compare them wit 
example, 


Ans. There are two major types of statistics — 
(i) Descriptive statistics 


(it) Inferential statistics. 


© i Descriptive Statistics — The branch of statistics devoted to tk 
summarization and description of data is called descriptive statisti 
Descriptive Statistics consists of method for organizing and summarizit 

information. r 
a oe Statistics includes the construction of graphs, charts, and tables 
culation of various descriptive measures such as averagess meas" 


of variation, and : 
iaeia percentiles. In fact, t F thi eals We 
descriptive statistics. , the most part of this course d 


siig eee i ential Statistics ~The branch of Statistics concerned we 
infërenttat: krata 5 iake an inference about a population of data is cal? 
measuring the reliabilin "erential statistics eonsist8 of method for drawing 
, ity of conclusion i informal" 
o s abou g a n info! 
btained from a sample of the population it population based 0 
Inferentia istics i ` 
estimation ai includes methods like point estimati 
pothesis testing which are all based on probability °° 


Example (Deseri 
si ao eseripti G 4 vent 
tossing dice. The dice ag ve and Inferential Statistics) — Consider oy 


sro : 2 pe 
data. Descript Hed 100 times and the results are forming the W 


ive statistics is awit 
table 1.5. istics is used to grouping the sample data to the follo” 


er 
on, inte”, 
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Table 1.5 
Outcome of the Roll | Frequencies in the Sample Data 
1 10 
2 ' 20 
3 18 
4 16 
5 11 
6 25 


Inferential statistics can now be used to verify whether the dice is a fair or 
not. 

Descriptive and inferential statistics are interrelated. It is almost always 
necessary to use methods of descriptive statistics to organize and summarize 
the information obtained from a sample before methods of inferential statistics 
can béused to make more thorough analysis of the subject under investigation. 
Furthermore, the preliminary descriptive analysis of a sample often reveals 
featurés that lead to the choice of the appropriate inferential method to be later 
used. 

Sometimes it is possible to collect the data from the whole population. In 
that case it is possible to perform a descriptive study on the popolation as well 
as usually on the sample. Only when an inference is made about the population 
based on information obtained from the sample does the study become 
inferential. 


Q.21. Discuss inferential statistics with example. 

Ans. Inferential statistics are used to make judgements about the probability 
hes an observed difference between groups is a dependable one or one that 
might happen by chance. In this study with inferential statistics, one concludes 
that extend beyond the immediate data alone. Thus, one uses inferential statistics 
2 take inferences from our data to more general conditions. Perhaps one of 
the simplest inferential tests is used when one has to compare the average 
Rae of two groups on a single measure to sec if there is a differ ó 

enever one whishes to compare the average performance between iwo 
froups one should consider the t-test for difference between groups ; A 
ieee of the major inferential statistics come from a geneal Lai of 
analysis ts known as the general linear model. = bi ose i 
analysis Of variance (ANOVA), analysis af covariance ( an sien a 

ei many of the multivariate methods like factor : - caer 
Paramet = Scaling, cluster analysis, discriminate function analysis, 
ers from observing the sample values. 


nce 
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atistics provide a way of - 
ample” to a “population” 
ers” of a population from data on the“ 


Inferential st 
going froma “s 
inferring the “paramet 


of a sample. E. 
i.e.. parameters like p and o, from statistics like m and s, 


But before we can see what is involved in the move from sample 

` to 

population we need to understand how to move from population to sample. 
The study of obtaining a sample from a population is “probability” 


Probability — 
— Probability —— 


Population Sample 
[Sampie] 


—— Inferential statistics —— 


Slatistigg 


For Example — The probability of picking a black ball from jar A ison 
half; the probability of picking a black ball from jar B is one tenth. 


Jar A- Jab B- 
90 black, 
10 white 


50 black, 


50 white 


Q.22. Explain the term statistical analysis of data.(R.GP.V., May 201 g 


Ans, Statistical data analysis isa 
operations. It is a kind of 
data, and typically, a 
basically involves d 


procedure of performing various statistical 
of quantitative research, which seeks to quantify tht 
pplies some form of statistical analysis. Quantitative dat 
€seriptive data, such as survey data and observational dat 


Statisticg 4 Á 
whicha a A aalysis generally involves some form of statistical too! 
are various ee Fi pea without having any statistical knowledge. The 

© packages to perform statistical data analysis. This sow 


includes Statistical A lysi ; 
Sciences (SPSS), a System (SAS), Statistical Package for the w 


3 0 A vee h 
variable(s). Sometimes ah T) Data in Statistical data analysis consists ° 
the number of variables ti data is univariate or Multivariate. Depending 

>» e researcher performs different statistical technig?® 


, ' 
iates can be si Mii pda is multiple in numbers, then seve 
ata an: aly These are factor statistical data analy 
te aul Bs, ete. Similarly, if the data is singuli! 

istical data analysis is performed. This includ 


test, f-test. 
The data in Statist; sie 


continuous data è 
f and dis 
be counted, For example, į a. The continuous data is the one that 0 
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e discreet data is the one that can be counted. For example, the 
per of bulbs can be counted. 
num he continuous data in statistical data analysis is distributed under 
The s distribution function, which can also be called the probability density 
function, or simply pät. 
The discreet data in statistical data analysis is distributed under discreet 
tribution function, which can also be called the probability mass function 


counted- Th 
continuou 


dis 
or simple pmf. 

We use the word ‘density’ in continuous data of statistical data analysis 
because density cannot be counted, but can be measured. We use the word ‘mass’ 
in discreet data of Statistical data analysis because mass cannot be counted. 


There are various pdf’s and pmf’s in statistical data analysis. For example, 
Poisson distribution is the commonly known pmf, and normal distribution is 
the commonly known, pdf. 

These distributions in statistical data analysis help us to understand which 
data falls under which distribution. If the data is about the intensity of a bulb, 
then the data would be falling in Poisson distribution. 

There is a major task in statistical data analysis, which comprises of 
statistical inference. The statistical inference is mainly comprised of two parts 
— estimation and tests of hypothesis. Estimation in statistical data analysis 
mainly involves parametric data — the data that consists of parameters. On the 
other hand, tests of hypothesis in statistical data analysis mainly involve non 
Parameters data — the data that consists of no parameters. 

, Traditional methods for statistical analysis — from sampling data to 

interpreting results — have been used by scientists for thousands of years. But 

pend heat volumes make statistics ever more valuable and powerful. 

loan o € storage, powerful computers and advanced algorithms have ali led 
Teased use of computational statistics. 

te we are working with large data volumes or running — 

for toda ‘ons of our calculations, statistical computing has become essentia 
Y $ statistician. Popular statistical computing practices includes — 


@ Statistical Programming — From traditional analysis of yarrance 
T regression to exact methods and statistical visualization techniques, 
l Programming is essential for making data-based decisions 1m every 


aud linea: 
Statistica 
field, 
(i) Econometrics — Modeling, forecasting and simulating business 
for improved strategic and tactical planning. This method applies 
© economics to forecast future trends. 


Procegse Š 
Statistics t 


ing (Vi-Sem.) 

(iii) Operations Research — Identify the actions that Willy 
he best results — based on many possible options and outcomes, Sched, 
nan and related modeling processes are used to optimize 
and management challenges. 
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hed, 
b Pe i 
Using 
processes 
(iv) Matrix Programming — Powerful computer techni; 


— N ques 5, 
implementing your own statistical methods and exploratory data analysi © 


i Ysis us, 
row operation algorithms. i 

(v) Statistical Visualization — Fast, interactive statistica] ą 
and exploratory capabilities in a visual interface can be used to und 
data and build models. 


Maly, 
lerstay 


(vi) Statistical Quality Improvement — A mathematical approach, 
reviewing the quality and safety characteristics for all aspects of productio 


Q.23. What is the function of F test ? Explain. 


Ans. An F test for whether the simple linear regression model “explain 
(really, predicts) a “significant” amount of the variance in the response. Wu 
this really does is compare two versions of the simple linear regression mod 
The null hypothesis is that all of the assumptions of that model hold, and i: 
slope, Bi, is exactly 0. This is sometimes called the “intercept-only” mods, 
for obvious reasons. The alternative is that all of the simple linear regressit 
assumptions hold with By € R. The alternative, non-zero-slope model! 
always fit the data better than the null, intercept-only model; the F test aksi 
the improvement in fit is larger than we would expect under the null. 


i sc are situations where it is useful to know about this precise quanti 
check Sorta $ testion the regression. It is hardly ever, however, a good way" 
ether the simple linear regression model is correctly specified, beca 


neither retaining nor rejecti 
rejecting the n a ; f t what ¥ 
really want to know. 8 ull gives us information abou 


€ retain the null'hypothesis, i.¢., we do not find?" 
ance associated with the regression. This could 


te 
iat 


null, intercept-only hypothes!® 
odel is right. It means that the a 
-only model — too much bette! g 
sion model can be absolute £% ” 
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single one of its assumptions flagrantly violated, and yet he 


ith every ; se assumpti 
4 a the model which makes all those assumptions and thinks the 
thar 


slope is zero. 

Neither the F test of Bı = 0 vs. B, #0 nor the Waid test of the sa 
hypothesis tell us anything about the correctness of the simple linear regressi 
model. AH these tests presume the simple linear regression model with Gaussi 
noise is true, and check a special case (flat line) against the general one {titl 
line). They do not test linearity, constant variance, lack of correlation. 
Gaussianity. 

Q.24. Write short note on t-test. 

Ans. A t-test is used to compare the mean scores obtained by two grou| 
on a single variable. The critical ratio test or t-test is used for two samp 
difference of means. Here it is applied to determine the differences betwee 
means of two scores obtained from the one group based on the two variable: 
It is very useful when the population variance is not known and when th 
sample size is small. The formula for estimating the ratio following ANOV/ 
test is = 


optir 


p= IMı—Mə2l 
of, oF 
N; N, 


Where, M, = Mean of the first sample 
M, = Mean of second sample 
©, = Standard deviation of first sample 
© = Standard deviation of second sample 
N, = Sample size of the first sample. 
Nz = Sample size of the second sample 
Interpretation of t-ratio — If the calculated t is less than the tabulated 
values of t at 0.05 or 0.01 levels then the null hypothesis is accepted. If the 
paculated t is greater than the tabulated t at 0.05 or 0.0! levels then the null 
ae 's rejected. In the present study, if the ANOVA value is significant 
YPothesis is further subjected to t-test. 


2.25. Describe the scope of statistical method. 

Plannin @ In Government and Public Sectors — mene — 

Methods ii any plan, to be successful, must be based on sta ae 

unem k re required to deal with immediate problems such as of fo — 
Ployment, productive tarifis. An estimate of the revenue and expen 


for th 

sui atk: vernment 

Mac © ensuing year is necessary for the successfull running of the Governox 
achinery, ry s 
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äi) In Business and Commerce — A manufacturer in Order tg 

ii) a make a study of the seasonal changes in the demand 
moea os tes of interest for borrowing. A manufacturer such as of 
goods, and the ra te sizes and designs which are most in demand, A 
or cloth must io eae when to run special trains and when to run 
pan pag ek companies in deciding upon the premium to be ¢ 
esate to be granted have to consider the mortality or sickness 
‘oie experienced, rate of interest likely to be earned. 


‘hg 
Rail 
Curtail 
hanged à 
ete. like, 


(iii) In Medical Science — Statistical methods are necessary in fini 
the effectiveness of medicines and drugs for the prevention and cure of, diseag 


(iv) In Agricultural Research — Much ingenuity and statistic 
knowledge is required in the design and analysis to test the effect of differen 
types of manures, levels of irrigation and varieties of crops. 


(v) In Meteorology — Weather forecasting depends on statistici 
methods. 


(vi) Itis advantageous in Education, Anthropometry and higher scienos 


Q.26. Describe the statistical methods and specify their limitations. 


Ans. The statistical methods are devices by which complex and numeric: 
data are so systematically treated as to present a comprehensible and intelligit 
view of them. In other words the statistical methods is a technique ’Used 
al data. The different steps thatare includ 
Collection of data, Classification, Tabulatit 


ently, the characteristics which cannot be me 


ied satisfactorily. Such characteristics a" 
nce, honesty etc. r 


(iv) Statistical ions ift 
results might e nelusions © ay 
are quoted short of their context, lee hia oi s 


vaccinated persons és The argument that “in a one at 
“pox; therefore vaccination is 


annot be stud 
+ health, intellige: 
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y defective, since we are not told what percentage 


isticall f 
statistica t vaccinated and died. 


o were no A r , 
vi (v) Statistical technique is the same for the social 
while both are different on nature. 


of the persons 


as for the physical 

jences, 

scie (vi) Only one who has an expert knowledge of statistical methods, 
handle the statistical data properly. The data placed in the hands of an 

a may lead to fallacious results. 

1 


9.27. Explain the term probabilistic analysis of data. 


Ans. A widespread application of such an analysis is weather forecasting - 
for more than a century, hundreds of weather stations around the world record 
various important parameters such as the air temperature, wind 
precipitation, snowfall etc. Based on these data, scientists build models 
reflecting seasonal weather changes (depending on the time of the year) as 
well as the global trends — for example, temperature change during the last 50 
years. These models thought of by themselves, but not necessarily generate 
good assessments, The very fact that there was correspondence about the 
gambles ~and occasionally some disputes about them — indicated that people 
do not automatically assess probabilities in the same way, or accurately (e.g. 
corresponding to relative frequencies, or making good gambling choices). 


The Von-Neumann and Morgenstern work. however, does involve some 
Psychological assumptions that people can engage in ‘good’ probabilistic 
thinking. First, they must do so implicitly, so that the choice follows certain 
“choice axioms’ that allow the construction of an expected utility model ~- i.e.. 
a model that represents choice as a maximization of implicit expected utility; 


that in turn requires that probabilities at the very least follow the standard 
axioms of Probability theory. 


whi also implies considering conditional probabilities in a rational manner, 
ca is done only when implicit or explicit conditional probabilities are 
Consistent with 


work d Bayes’ theorem. Thus, the Von-Neumann and Morgenstern 
TK required that peo 
term is so, 


$ ple be ‘Bayesian’ in a consistency sense, although that 
metimes used to imply tha ilities should at base be interpreted 
a ply that probabilities should at base rpi 
5 degrees of belief. 


Anoth 
should be 


speed, 


er Way in which 


Probability assessment must be good’ is that there 
Some at least 


reasonable approximation between probabilities and 
; in fact, under particular circumstances (of 
nitely repeated observations), the probabilities of 
fis constrained by Bayes’ theorem must approximate 


an analogy with swimming. People do s 


well, but 
We drown, ¥ c 


; ; systematic 
Vhat happens is that there is į syster 
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bias in attempting to swim that makes it cone We vee to hold oy, 

> water, When, however, we raise our heads to do so, we tend waa 
oa in the water, which is one of the few ways of tidings 
ae from freezing of exhaustion or sanr Ae wr Watery) yy 
systematically deviate from the mules of probabilistic thinking, Again, howe, 
the emphasis is on ‘systematic. 

For example, there is now evidence that people’s Probabilistic judge, 

are ‘sub additive’ — in that when a general class is broken into components, 
judgementally estimated probabilities assigned to disjoint Components t 
comprise the class sum to a larger number than the probability assigned tot 
class. That is particularly true in memory, where, for example, people me 
recall the frequency with which they were angry at a close friend or Telativey 
the last month and the frequency with which they were angry at a total Strange 
and the sum of the estimates is greater than an independent estimate of bx 
angry period (even though it is possible to be angry at someone who is neite 
a close friend or relative nor a total stranger). The clever opponent will te 
bet against the occurrence of each component but on the occurrence oft 
basic event, thereby creating a Dutch Book. 


Q.28. Write short note on statistics and linear algebra for ML. 


Ans, Linear algebra is a valuable tool in other branches of mathemalis 
especially statistics. 


The impact of linear algebra is important to consider, given ® 
foundational relationship both fields have with the field of applied matii 


learning. Some points of linear algebra on statistics and statistical method 
as follows — 


os (i) Use of vector and matrix notation, especially with mutiva 
statistics. 


(ii) Solutions t nast 


r o least squaresiand weighted least squares, SU¢! 
linear regression, eh a 


(iii) Estimates of mean and variance of data matrices. i 
(iv) The covariance i in onl 
s fv) matrix that pla role in m 
Gaussian distributions, ins 


: (v) Principal com; s ; draws 
5 ponent anal: stion that 
of these ele, ents together nalysis for data reduction 


¢ 
As we isti K 
bos cha See, modern Statistics and data analysis, at least 4 : 4 
understandi machine learning practitioner are concerned, depe" 
ing and tools of linear algebra. 
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0.29. Define some examples of linear algebra in machine learning, 
Ans. Some examples of linear algebra in machine learning are as follows 

(i) Linear regression 

(ii) Regularization 

(iii) Principal component analysis (PCA) 

(iv) Singular-value decomposition (SVD) 

(v) Deep learning. 


(i) Linear Regression — It is an old method from statistics fo 
describing the relationships between variables. It is often used in machin 
learning for predicting numerical values in simpler regression problems. There 
are many ways to describe and solve the linear regression problem, i.e. finding 
a set of coefficients that when multiplied by each of the input variables and 
added together results in the best prediction of the output variable. If you have 
used a machine learning tool or library, the most common way of solving 
linear regression is via a least squares optimization that is solved using matrix 
factorization methods from linear regression, such as an LU decomposition or 
an singular-value decomposition or SVD. Even the common way of 
summarizing the linear regression equation uses linear algebra notation 

x= Baa 
where x is the output variable B is the dataset and a are the model coefficients. 


$ (ii) Regularization — 1n applied machine learning, we often seek the 
simplest possible models that achieve the best skill on our problem. Simpler 
models are often better at generalizing from specific examples to unseen data. 
ee methods that involve coefficients, such as regression methods and 
that ae neural networks, simpler models are often characterized by models 
a model ee coefficient values. A technique that is often used to encourage 
Called ain the size of coefficients while it is being fit on data is 
of teilen arization. Common implementations include the Le and L! forms 
of the ma a Both of these forms of regularization are in fact a measure 
lifted lire ude or. length of the coefficients as a vector and are methods 

y from linear algebra called the vector norm. 


Many Bd Principal Component Analysis (PCA) - Often a dataset has 

With man Pe perhaps tens, hundreds, thousands or more. Modeling data 

irrelevant, “atures is challenging, and models built from data that încipde | 

televan ee are often less skillful than models trained from the ao | 
Which are. nos It is hard to know which features of the data are relevant and 

Rot. Methods for automatically reducing the number of columns of 

| 


A data 
Set are called dime 


1 popular 


Nsionality reduction,.and perhaps the 
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method is called the principal component italyan TECA for short, Thig 

is used in machine learning to create ae o high-dimensiong int 
both visualization and for training: models. The core of the PCA me ty 
cactorization method from linear algebra. The eigen decom, 


matrix f € * i ` i 
be used and more robust implementations may use the sin 


decomposition or SVD. 

(iv) Singular-value Decomposition (SVD) — Another Pa 
dimensionality reduction method is the singular-value decomposition tg 
or SVD for short, As mentioned and as the name of the method Suggests, j, 
a matrix factorization method from the field of linear algebra; It has viel 
in linear algebra and can be used directly in applications such ag fe 
selection, visualization, noise reduction and more. 


Mie 
Position X 
t 

gul ava 
k 


fe 


(v) Deep Learning — Artificial neural networks are nonlinear macha 
learning algorithms that are inspired by elements of the information proces 
in the brain and have proven effective at a range of problems not least predi 
modeling. Deep learning is the recent resurged use of artificial neural net 
with newer methods and faster hardware that allow for the developmen: 
training of larger and deeper (more layers) networks on very large date 
Deep learning methods are routinely achieve state-of-the-art results onang 
of challenging problems such as machine translation, photo captioning spi, 
recognition and much more. 


At their core, the execution of neural networks involyes linear alg* 
data structures multiplied and added together. Scaled up to multiple dimensi} 
deep learning methods work with vectors, matrices and even tensors ofi iy 
and coefficients, where a tensor is a matrix with more than two dimen“ 
Linear algebra is central to the description of deep learning, methods viat 
notation to the implementation of deep learning methods such as Gone 
TensorFlow Python library that has the word “tensor” injits name. 


2.30. Define some convex optimization problems for machine le 


rae š ing 
Ans, Some convex optimization problems for machine Jearning 
foliows — 


m 
min > f(x) +AR(x) 
xeR" i f 
where the functions f}, ....., fm R are convex and A > 0 is a fixed Lae 
The interpretation is that f(x) represents the cost of using x 0n me f 
nf some data set, and R(x) is a regularization term which enfo ale 

simplicity” in x. We discuss now major instances of equation ms aie 
one has a data set of the form (wi yi) e R" x Y, i= 1, 2, 3, -4v ne 


tunction f; depends only on ihe pair (w, yi) fe 
Pay 
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In classification one has Y = {-1, Ik Taking f(x) = max(0, 1 - yxtw) 
ahe so-called hinge loss) and R(x) = jj x ||3 one obtains the SVM problem, On 
the other hand taking f(x) > log(1 + exp(-yxw,)) (the logistic loss) and 
jgan R(x) = [| x 3 one obtains the logistic regression problem. 

In regression one has Y = R, Taking f(x) = (xTw; — y,)? and R(x) = 0 one 
obtains the vanilla least-squares problem which can be rewritten in v 
notation as — 


ector 


min || wx- YI 
xeR? 


where W e R™*" is the matrixwith wf on the i" row and Y = (Yp e Yo) 
ae 


With R(x) = |} |Joone obtains the ridge regression problem, while with R(x) 
= |jx||, this is the LASSO problem. 


In our last example the design variable x is best viewed as a matrix, and 
thus we denote it by a capital letter X. Here our data set consists of observations 
of some of the entries of an unknown matrix Y, and we want to “complete” 
the unobserved entries of Y in such a way that the resulting matrix is “simple” 
(in the sense that it has low rank). After some massaging the matrix completion 
problem can be formulated as follows — 


min.Tr(X) 
s.t. X e R™", XT= xX, X20, X= Yi for (i,j) eQ 


where Q c [n]? and (Yipijea are given, 


2.31. What do you understand by data visualization ? Discuss some 


Python ’s data visualization tools such as box plots pie charts and bar charts 
in brief. 


Ans. Data visualizations helps top management who are the decision 
Makers to view analytics being visually represented, so it makes them to easily 
ioe stand the complex ideas and identify the new structures or patterns. When 

'sualization becomes interactive, then we are able to push the concept a little 
further thereby using technological tools to grasp more details from graphs 
te eo. therefore making changes to the data that is being seen and how 

ata is being Processed. 


It also means putting data forward and representing them in à particular 
methodic z 


bsi 5 al layout which contains some variables and attributes for bringing 
ut information. Visualization-based data discovery techniques gives room 


to busi 2 at 
k “SINess owners to make up sources of completely different data so as to 


ente 3 A 
custom analytical views. 
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ti) Box Plots — A box plot isa graphic representation Of quantity; 
based on their quartiles, as well as their smallest and largest vee 
plot contains a box and two whiskers. The bottom andy Nes 
third quartiles, and the line inside the box is the ‘i ë 
The box area represents the range between the first and third quartiles, Py 
is called the interquartile range (IQR). The ends of the two whiskers indica, 
the maximum and minimum values of the variable. Box plots can have ma 
variations. For example, complex box plots mark outliers (three OF More tip, 
the IQR above the third quartile or below the first quartile) and sı 
outliers (one and a half or more times the JQR above the third quartile, 
below the first quartile). Ina complex box plot, if either type of outlier appen 
the end of the whisker on the appropriate side changes to one and a half 
from the corresponding quartile. The end of this whisker is defined as an ite 
fence, and the third IQR from this quartile is considered as the outer feng 
Outliers in this plot are displayed as filled circles and suspected outliers» 
displayed as unfilled circles. Figs. 1.4 and 1.5 show an example of simplea, 
complex box plots. 


v ariables 
The simplest box 
the box are the first and 


Maximum 15 
20 
Outliers 
15 10 
Third Quartile Suspected 
10 . Outliers 
IQR 5 
5 
First Quartile 
Mini: 
6 linimum 0 


Fig. 1.4 Simple Box Plot Fig. 1.5 Complex Box Plot 
We can see that it’s super easy to create this plot with Matplotlib. ae 
need is the function plt. boxplot( ). The first argument is the data polit 
The simple box plot in using python’s — 
values = [1, 2, 5, 6, 6, 7, 7, 8, 8, 8, 9, 10, 20] 
plt-boxplot(values) 
plt.yticks(range(1, 21)) 
plt-ylabel("Vvalue") 
pit.show( } 
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Charts — It is as well-known 
ie chart shows information 
asa circle Lee E a way that is not difficult 
“pie-slice” form and the various 
shows how much of an element is 
: n the slice is big, then it shows 
i EEA es gathered. It is also used to 
of pa values of data and the moment some 
Ga p are represented on pie chart, then we will 
te able to view which of the items is the least oe 
popular or which is more popular. The best and “4&4. ‘= ae Standard 
effective way to make use of apie chart is when e Chart 
they contain a few.components and when the percentages and texts are also 
involved in order to define the content. By providing additional information, 
report consumers do not have to guess the meaning and value of each slice. If 
you choose.to usa pie chart, the slices should be a percentage of the whole. 


(ii) Pie 


Donut Chart Demonstration 


= j. Labor 

@ 2. Licenses 

w 3, Taxes 

a 4, Legal 

® 5, Insurance 
“ 6, Facilities 

“ 7, Production 


17.48% 

12.59% 

10.49% 
8.39% 
5.59% 
20.98% 
24,48% 


m Apple 

S Blue Berry 
Key Lime 
@ Pumpkin 
Pecan 


Fig. 1.8 A Simple Exploding Pie Chart 
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A wedge is used to represent a data parts that has the same 
and the pie chart control usually decides the data wedge size When its i 
compared with the other data wedges. Pie charts Consist of ty, 
variations called Doughnut chart and Exploding pie chart, The Doy 
are almost same as the standard pie chart just that it consist of hollow 
and the exploding charts, the wedges are being obtained from the other ii ’ 
A pie chart with Matplotlib, we can use the plt.pie( ) function, The ie 
parameter allows us to display the percentage value using the Python w 
formatting. 
sizes = [25, 20, 45, 10] 
labels = ["Cats", "Dogs", "Tigers", "Goats"] 
plt.pie(sizes, labels=labels, autopct="%.2f") 
plt.axes( ).set_aspect("equal") 
plt.show( ) 


characte, 


4 


4 


(iii) Bar Chart — Bar chart is as well referred to as column charta 
they are used to for comparison of items of different groups. The bars are w 
to represent the various values of a group and the bar chart makes use of be 
horizontal bars and vertical bars. When the values to be represented are clei? 
different and such differences in the bar are been seen by human eye, thena 
can decide to make use of a bar chart, but when there are very huge numb: 
of value to be displayed, then it might be a bit more hard to make comparsi, 
between the bars. Most times, bar chart is used to represent discrete dataan 
is as well used to present single data series while the data points thatare ret 
are often being grouped in a series. 


The Cookie Shop 
2003-2005 Income 


2003 


Fig. 1.9 Displaying a Simple Bar Chart 
A bar chart w 


ith Maplotlib, we will need the plt.bar( ) function- 
#Our data 


2004 


2005 


o Popy 


Unit-t 3! 


_ {"2003", "2004", "2005"] 


labels {o 10000, 20000. 30000, 40000, 50000, 60000, 70000, 80000. 90000] 
usage ~ Ls ` 


siti Later. we will use them to replace them with 
ing the y positions. 
Generating 


ihe siti 
athe our bar plot 
plt.bar(y_positions, usage) 
plt.xticks(y_positions, labels) 
plt-ylabel("Usage (%)") y 
plt.title("The cookie shop") 
pH.show( ) 


ons = range(len(labels)) 


0.32. Explaindata visualization techniques, 

Ans. Visualization is the use of computer-supported, v isual representatie m 
of data. Unlike static data v isualization, interactive data visualization allows 
users to specify the format used in displaying data. Common visualization 
techniquesiare as shown in fig. 1.10. 


Line Graph 


Fig. 1.10 Commonly used Data Visualization Techniques 


b @ Line Graph — This shows the relationship between items it can 
used to compare changes over a period of time. 


(i) Bar Chart—This is used to compare quantities of diffe sent categories 
(ii) Scatter Plot — This is a two-dimensional plot showing variation 


of two items, 
TI (iv) Pie Chart — This is used to compare the parts of a whole. 
chan iüs, the format of graphs and charts can take the form of bar chart, pie 
for y a Staph, ete. It is important to understand which chart or graph te usc 
t Your data, 
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uses computer graphics to show patterns, te 
relationship among elements of the data. It can generate pie charts, hing 
scatter plots, and other types of data graph with simple pull-down a Che, 
mouse clicks. Colors are carefully selected for certain types of viral 
When color is used to represent data, we must choose effective Pe, 
differentiate between data elements. OT 
In data visualization, data is abstracted and summarized. Spatial vat 
such as position, size, and shape represent key elements in the di 
visualization system should perform a data reduction, transform and weal 
the original dataset on a screen. ye 
It should visualize results in the form of charts and graphs and Pree, 
results in user friendly way. 
0.33. Discuss the application of data visualization. ig 
‘Ans. Most visualization designs are to aid decision making and el 
tools that augment cognition. In designing and building a data visualizati 


e, one must be guided by how the visualization will be applied. Da 
it involves selecting ai 


Data visualization 


prototyp 
visualization is more than just representing numbers; 


rethinking the numbers on which the visualization is based. 
Visualization of data is an important branch of computer science andin 
wide range of application areas. Several application-specific tools have bea 
developed to analyze individual datasets in many fields of medicine and scient, 
o analyze andypresent data ine 
lic health surveillant 
rk. Secut 
ical’ 


(i) Public Health — The ability t 
understandable manner is critical to the success of pub 
Health researchers need useful and intelligent tools to aid their wo: 
is important in cloud-based medical data visualizations. Open any med“ 
health magazine today, and we will see all kinds of graphical representatio® 

(ii) Renewal Energy — Calculation of energy consumption compat 
to production is important for optimum solution. 
mental manages 
data, they pr 


onmental res” 
roa osal dib” 


. (iii) Environmental Science = As environ 
required to make decisions based on highly complex 
visualization. Visualization applications within applied envi 
are beginning to emerge. It is desirable to have at one’s disposal 
programs for displaying results. 
tant in the 
jsualiza! ¢ 
est fraue™ | 


(iv) Fraud Detection — Data visualization is impo 
stages of fraud investigation. Fraud investigator may use data V 
a proactive detection approach, using it to see patterns that suge' 
activity. 
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ry-decision Making — Data visualization software allov 

© tt the flexibility to better manage and present information collecte 
jibrarian pt sources. It gives them the skill to present information in a creativ 
Sais Visualization of library data highlights purchasing decision: 
eling Y 7 ds and goals. Librarians, as de facto experts of dat 


ibrary nee ; o exp ; 
«ation, can assist students, faculty and researchers visualize their data 
formation visualization algorithms and associated software havı 


several in r ; 
been developed. These software enable users to interpret data more rapidly 
el 
ever before- 
0.34. Write short notes on the following — 
i) Histogram 
(i) Quaniile plots 
(ii) q-q plots 
(iv) Scatter plot 
(vy) Loess curve. 


(v) Libra’ 


Ans. (i) Histogram — Plotting histograms, or frequency histograms, 
is a graphical method for summarizing the distribution of a given attribute. A 
histogram for an attribute A partitions the data distribution of A into disjoint 
subsets, or buckets. Typically, the width ofeach bucket is uniform. Each bucket 
is represented by a rectangle whose height is equal to the count or relative 
frequency of the values at the bucket. If A is categoric, such as automobile_ 
model or item_type, then one rectangle is drawn for each known value of A, 
and the resulting graph is more commonly referred to as a bar chart. If A is 
numeric, the term histogram is preferred. In an equal-width histogram each 
bucket represents an equal-width range of numerical attribute A. 

Fig. 1.11 shows a histogram for the data set of table 1.6, where buckets 
are defined by equal-width ranges representing $20 increments and the 
frequency is the count of items sold. 

6000 
5000 
4000 


Count of Items Sold 
w 
2 
s 


60-79 * 80-99 100-19 120439 
Unit Price ($) 


Fig. 1.11 A Histogram for the Data Set of Table 1.6 
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Table 1.6 A Set of Unit Price Data for Items Soiq 
All Electronics 


Units 39 
ata Branch y 


ntile for each data set and shows the unit price of items sold at 
me qual 


the $4 versus branch 2 for that quantile. To aid in comparison, we also show 
Ne Unit Price (S) Count of Items Sold branch | T that represents the case of when, for each given quantile, the 
= ight 4 i 
40 275 a am at each branch is the same l 
43 300 unit P ddition, the darker points correspond to the data for Q,, the median, 
3 5 In a : 
47 250 ectively. 
d Q3, resp 
ie ase 120 <> 
360 a no 
515 2 100 
540 = 90 
iz 5 80 
a za 
270 Z 
350 ae 50 
40 
isasi recti 70 80 90 100 n0 129 
(ii) Quantile Plots — A quantile plot is a simple and effective way, 4 50 a Mieta 
have a first look at a univariate data distribution. First, it displays all of ty i EA 
° . e r ji ata from 
data for the given attribute. Second. it plots quantile information. Tk Fig. 1.13 A Quantile-quantile Plot for Unit Price fi 
mechanism used in this step is slightly different from the percentile computatia 


Two Different Branches 
Let x; for i = 1 to N, be the data sorted in increasing order so that X, ist (iv) Scatter Plot— A scatter plot is one of the most effective a 
smallest observation and Xy Is the largest. Each observation, X;, is pair methods for determining if there appears to be a palpate pen or trea 
a percentage. fi, which indicates that approximately 100 f, % of the dana between two numerical attributes. To construct a scatter plot, an a ate 
below or equal to the value. Xi. We say “approximately” because pierog values is treated as a pair of coordinates in an algebraic sense fa ia o pies 
be a value with exactly a fraction, fi» of the data below or equal to x;. Fig My points in the plane. Fig. 1.14 shows a scatter plot for the set of data in 


shows a quantile plot for the unit price data of table 1.6. 1.6. The scatter plot is a useful method for providing a first look at bivariate 
140 


data to see clusters of points and outliers, or to explore the possibility of 
a correlation relationships. 
2 100 
Ë 80 oo 
pa 600 
È z 
Es 5 500 
ie 2 g 400 
È 300 
i Š 
0.000 0.250 0.500 0.750 1.000 3 
f-value 6 f 100 
Fig. 1.12 A Quantile Plot for the Unit Price Data of Table I- J UOA a o n o o e 
de sraphs | it Price ($) 
(iii) q-q Plot — A quantile-quantile plot or q-q plot, gf He sf jane 


quantiles of one univariate distributio 
another, It is powerful Visualization to 
there is a shift in going from one dis 


; ant 
n against the corresponding a K 
ol in that it allows the user to vie extens; en 
tribution to another. 3 OSION te 


Fig. 1.14 A Scatter Plot for the Data Set of Table 1.6 PETF 
dealing with several attributes. a scatter-plot matrix is a 


4 nahin rix isann * n 
wri FPA © the scatter plot. Given n attributes, a een Nom with every 
k : s ey items™ y Of scatter ide: isualization of each a 
Fig. 1.13 shows a quantile-quantile plot for unit price data off pos" Other attribute Plots that Provides a visu 
two difercat branches during a given time period. Lach point cot à 
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(v) Loess Curve — A loess curve is an important ex 
aid that adds a smooth curve to a scatter plot in order to provide bette, Pere 
of the pattern of dependence. The word loess is short for ‘local y 
Fig. 1.15 shows a loess curve for the set of data in table 1.6, 


Plorator, P 


Begs, 


i] 


Items Sold 


Unit Price ($) 
Fig. 1.15 A Loess Curve 

Two parameters are needed to fit a loess curve. aa 
and A, the degree of the polynomials. While œ can be 
can be 1 or 2. The goal in choosing a is to produce a 
possible without unduly distorting the underlying patte: 
becomes smoother as a increases. There may be some lack of fit, howeve. 
indicating possible “missing” data patterns. If œ is very small, the underlyiy 
pattern is tracked, yet overfitting of the data may occur where local “wiggles 
in the curve may not be supported by the data. If the underlying pattern oft 
data has a ‘gentle’ curvature with no local maxima and minima, then loc 
linear fitting is usually sufficient (A = 1). However if there are local maxim 
minima, then local quadratic fitting (A = 2) typically does a better job 

following the pattern of the data and maintaining local smoothness. 
Q.35. Write short notes on the 
@ Table chart 
(iti) Tree map 
©) Line chart 


Ans. (i) Table Cha 
and column. | 


Smoothing paramet 
any positive num 
fit that is as smooths 
rn in the data. The cure, 


following — 

(ii) Bubble chart 

(iv) Parallel coordinate 
(vi) Area chart. 


. . i o 
rt — Table is simply the arrangement of data uneg 
; ; n conducting research and analysis of data, the role oft Je” 
very important. Tables are simple to understand and analyze and it’s ait 

interpret the method of data representation, A row is a representation of vari i 


i ales 
and column 1S as well a Tepresentation of records that have set of valu rr 
times, this ord 


are shown in table LZ. 


Unit-l 44 


Fable 1.7 


Homeowners 


Renters 


; | Percent 
a, re 
Expenditu 2010 | es 
$33,460] 9 
val expenditures 4802| -2 
Annual © k = 
aa at home 1,900 13 
Food away from home 12.843 16 
i i 1,344] -27 
pps and ae a 5.046 
Transportation r LSI 
Gasoline and motor oil 1,518 
Healthcare 909 
Health insurance 


1,390 
2,907 


Entertainment an 
Personal insurance and pensions 


(ii) Bubble Chart — A bubble plot is some degree of difference ofa 

atter plot and the markers in it are being substituted with bubbles and this is 
a ssible only we have a set of data points which has three values contained in 
fect data item. It shows the relationship that exists between the minimum of 
three variables. Two of them gets represented by the plot axes i.e., x-axis and 


y-axis, while the third one by the bubble size and each bubble is a representation 
of an observation. 


Bubble plot is used with a lot of value, say hundreds of them or also used 
ifthe values are somewhat different by numerous structure of magnitude. Colors 
are being used to represent an additional measure and the bubbles could be 
Subjected to animation in order to show data changes over a period of time. 
Annual Sales Chart 


Original Cost (Bt./kg) 


Price (Bt/kg) 


Fig. 1.16 Simple Bubble Plot 
also very useful in project management in comparing 
ess involved in executing a project and where there are 


The bubble Plot is 


th 7 
€ rate of risk and Suce 
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ree values as net present values, then the probability of succe 
ım represent the bubble size. 

(iii) Tree Map — A tree map is a visualizing technique a 
ttribute of showing data in hierarchy ina nested or layered rectangle rit 
; a very effective technique that is used to visualize structures of hierar, 
Jser are able to compare nodes and sub nodes at different depth an like w 
re able to identify expected results and patterns. A lot of data set hin ` 
ierarchy characteristics and the objects are thereby divided into aig 
livisions, sub divisions, etc. : 


SS and the, 
lig 


Immune System 


Non Specific Defenses Specific Defenses 


Aqui 
Prevents any Pathogen Tmmupiy | 
from Infecting | 
Hummoral Cell Meditated 
Immunity Immunity 
First Line Defense Second Line l | 
Defense B Cells T Cells 
Skin 
Prevents Pat} 
faia ETN inflammatory Response Interferons Fever 
Mucus, Tears, Saliva Redness and Swelling t 
Hair and Cilia Increase in Blood Flow b 


Stomach Acid WBS (White Blood Cells) Eat 


Up Pathogens (Phagocytes) 
| 


Histamines Signal 


Fig. 1.17 Tree Map Display Hierarchical Data | 


(iv) Parallel Coordinates — The parallel coordinate technique mal 
use of the concept of networking a multi-dimensional point to some ne 
alt of these are parallel to each other. In these technique, single data elem 
are being plotted across many dimensions and'these dimensions are conn ; 
to a y-axis and each object of the data is shown along the axes as a sete! 
connected points. The parallel coordinate is important if you want 10 sol 
multidimensional data and a lot of these dimensions are being organize 
expanded by this technique. 

When there is a line tha 
occurrences re 
Therefore the number of dimensi 
all. This visual 


aii A fot 
t traffic control, computational geometry, robotics and data mining: 


t forms a single polygonal line fo 


pane ; i 
ons that is being represented is not lim! 


all | 


. H jli 
presented, then it connects the individual coordinate map "H 


ae mens: l visit 
ization technique is applicable in areas such as compulet 


i „tage of this visualization technique is that it usua 
dvantis 


a gooda ithout limits. Though. you can encounter a case such as 


snsions W i ngs i 
ens sing overlapped which causes difficulty in identifying 
being ? yung 


and this caused when you have man) 


5 of wm 


ints that are 


represen > 
„ Line Chart — Line chartis common well known graph in many 
fA al as line graph. It is d graph which is use to display information 
a nS These pointsare connected through continuous or straight 
h is the extension of scatter plot. Data points can be represented 
bols, or can also draw simple line without icons. The line chart 
i ed to visualize a trend in data over time interval, means it use to 
A = Ancy in the data set and illustrate the data behaviour with the passage 
enr e over a specific intenvāl of time. 


jine. Line grap! 
by icons oF sym 


of time 0 
120 
120 
100 
z 100 
Š 80 = 
2 80 rr =z =—> 
-a > 60 
= = 40 
3 40 5 
7 20 m a 
ü 
E 2001 2002 2003 2004 2005 2001 2002 2003 2004 2005 
=== Hem A Tem B 


Fig. 1.19 Line Chart with 
Symbolic Data Point 


There are many form or variations of line chart or line graph, depends on 
the data points to be plot, for example — step line chart, reverse step line chart, 
vertical segment line chart, horizontal segment line chart, curve line chart. 


Fig. 1.18 Simple Line Chart 


___(¥i) Area Chart — Area chart is also called area graph, use to display 
Ph nutative data graphically. Area chart control is use represent data is hounded 
ioe bounded area is based on the line graph, the line is generated and the 
area stow is shaded with colors, different texture and hatching, which produce 
&raph as shown in the figs. 1.20 and 1.21. 


Inflation 


Inflation Rate 


2001 2002 2003 2004 2005 2006 2007 2008 
Fig. 1.20 Line Graph 
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Inflation 
1% né] 
Ž H=) { 
: 3j 
J001 2002 2003 2004 2005 2006 2007 2008 ) 
Fig. 1.21 Area Chart i Ans, 
There are many form of area chart or line graph, depends On the | 3. A lot has 10% de fen items are chosen randomly 
points to be plot, for example ~ step area chart, curve area chart while Mea wn a Find t t exactly 2 of the chosen items are 
area chart is default area chart as shown in the fig. 1.22. i es f {R.GEPV, Dec. 2016) 
| defective. 1 
f a =J—p=!-—=—,n=10,r=2 
j Sol. Here P 10 10 
E M Infa ! O i s 
nflation Food $ 10-2 2 
E = Inflation Overal i es = e al (=) 
$ | 2 10 to) (10 
| = 10% (0.01)(0.4305) = 0.1937 Ane 
X Z PROC, 4 2x] 
2001 2002 2003 2004 2005 2006 2007 2008 


b.4. A perfect cubical die is thrown a large number of times in sets 


of 8. The occurrence of 5 or 6 is called a success. In what proportion of the 
you expect 3 success. 


Fig. 1.22 Multi Series Area Chart, 


Sol. Here, n= 8, p= 


1 2 
Zar 3 
8 


f ‘ 
Í 224 
f Thus the Binomial distribution is given by N G + 3} 
1 
Sol. Here, p= A | 


The number of sets in which 3 success are expected 
We know that 


pel 


: 4 i A Percentage =N x 56232 100 _ 179200 3731% Ans. 
uG = Tala) - a 
2 2 2 \2 


27% 243 N 6561 
P, ike 
P, > vor eine The probability that a bomb dropped from aplana will strika 
mna Six dices are thrown 729 times. How many times 403° “US 1/5. If six bombs are dropped find the probability that 

at least three dices to show a five or six ? @ exactly two will ike liria 
2 CD at least two will strike the target. 
==.n=6,N=729 
3 


2\5| 56x32 
3) {> 27x243 


Sol. Here, p = 


(RGPK:, June 2013) 
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Sol. Given the probability of bomb that will strike the target is 


pel 
i : 
q=l- 3 z 5 
G) Therefore the probability of exactly two will strike i, 
‘aft. 
out of 6 bombs 7 Is 
= 6C, p qÉ 2 
‘ 2 4 
5, 6x5 G E _ 7680 
=*C, p*q? = ma (5) (5) = 31250 = 0246 nl 
(ii) The probability at least two will strike the target outofg bons, 
= | — (Probability that either none, or one are not Strike the $ 
6 5 ë 
4 S6) (: 6/4) 
6 6 
=1-| °Co =| +° Cy] — | — ay =|) 2) ge BES 
| (3) (5) 3 E 43) 
affa 6] _,_ 1024/10 
2 3] 5*5]  3125[ 5 
2048 _ 1077 | 5 j 
~ 3125 ~ 3195 ~ 0345 Au 


Prob.6. The probability that an evening college student will graduatei, 
0.4. Determine the probability that out of 5 students — 


(i) None (ii) One and (iii) At least one will graduate. B 


(R.GP.V., Dec. 2015, 20h) 


Sol. Given, the probability that an evening college student will gradine! 


isp=0.4. j 
i The probability that an evening college student will no graduate is q=1-" 

= 0.6. i 
(i) The probability of none will graduate out of 5 student | 

= 5Co pe 

= $C x (0.4) (0.6)° a 

=1 x1 x 0.07776)=. 0.078 | 


(ii) The Probability of exactly one student will graduate out of 5 swt? 
= 5C} (0.4)! (0.6)4 

Be =5 x 0.4 x 0.1296 = 0.2592 jos | 

(lif) The Probability of at least one will graduate out of 5 stu! | 


= 1 — (Probability of none will graduate) 
= 1 — 0.078 = 0.922 


As) 


w 
: 


Unit-t 


ved by a machine are defective, determine 
10% of bolts produced bj AERE 


r s; chosen at random (i) one fii) none fiii, 
prob. eo rhat out of 12 bolts, choser vin gs 
the prov jis will be defective. i 
most 8 ability of defective bolts, p = 10% = 0.1 
s. a of not detective bolts, q=1-p=1-0.1=0.9 
pabili TD 
pro q number of bolts. n = 12 i , i 
m A bability of one defective bolt Pevon (09) 
(i) Pro = [20.1 ~ 0.3138 = 0.3766 gud 
ili is defective 
ii bability that none Is E 
(ii) Probal = Probability of 0 defective bolt = P(0) 
= e9¢0.1°(0.9)" 
=1* 1 x 0.2824 = 0.2824 Ras: 


(iji) Probability of two defective bolts 
12x11 


1x2 


=!20,(0.1)7(0.9)!0= «0.01 x03487 = 0.2301 


Probability of at most two defective bolts 
= P(0 or 1 or 2) 
= P(0) + P(1) + P(2) 


= 0.2824 + 0.3766 + 0.2301 = 0.8891 Ans. 


Prob.8. An irregular six faced die is thrown and the expe ctation se 
10 throws it will give five even numbers is twice the expectation that it wi 


A 7 : ys woul 
give four even numbers. How many times in 10,000 sets of 10 throws wou i 
You expect it to give no even number ? 

‘ 


Sol. Let p be the probability of getting an even number. 
The Probability of 5 even numbers in 10 throws = "C, p> q$ 
The Probability of 4 even numbers in 10 throws = ie piq 
°W according to the given condition 
C, pS gS =2 x 10C, pt qê 


> 10x9x8x7x6 5s cgi 10x9x8x7 pta“ 
Sx4x3x2x] P TE 4x3x2xl 


6 
> gao 3 : 
5P= 2q=> Z(1~q)= q4=3-3q=54 
> 
8q=3 or j= 3 
8 ” 
“nee the number of times in 10,000 throws, where we get no even number 


= Ans. 
= 10,000 (3/8)'9 = 0.55 


y 
47 
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Prob.9. If 10% of | bolt’s produced by a machine are defectiy, 


the probability that out of 10 bolts, chosen at random OF, G) deten 
most 2 bolts will be defective. One (i, 
Sol. Here the probability of a bolt being defective is p = 10. 
100 70 


q=1-p=1-0.1=0.9 
(i) The probability of one defective bolt out of 10 
= 10C, (0.1)! (0.9) = 0.3874 
(ii) The probability that none is defective w 
= Cp (0.1)" (0.9)! = (0.9) = 0,3487 
(iii) Probability of 2 defective = 10C, (0.1)? (0.9)8 = 0.1937 
Probability of at most 2 defective 
= Probability of none defective 
+ Probability of one defective 
+ Probability of two defective 
= 0.3487 + 0.3874 + 0.1937 = 0.9298 An 


Prob.10. If in a lot of 500 solenoids 25 are defective, find the ‘Probably 
of 0, 1, 2, 3 defective solenoids in a random sample of 20 solenoids, 


Sol. Here, Pk 
500 20 

AT E DEE i 

l , 20 20 i 

(i) The probability that none is defective out of 20 | 

aig (1 97 19 \20 f 

= se a a Ans! 

o (+) (2) 0.3585 hgg 


(ii) The probability that one is defective outof 20 


1 19 
a 1\19 19 
ee (=) (2 _ 1/19)” © i5 
1 (20) (20) a20- sgl sq, 70373 | 


(iii) The probability that two is defective out of 20 


2 18 18 i 
= 20, 1 19 1 (19 f 
C2 (a) (3) = 190=—|—]| = 01887 * 
hee 20) (20 400 | 20 ia | 
© probability that three is defective out of 20 
J ac, (4 3 19 17 
20 20 


_ 20x19x18 1 19)!7 
3x 2x] ( ) = 0.0596 


Awi 
2s x 
8000 20 
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FESIS FUNCTION AND TESTING, DATA DISTRI- 

HYPOTHE ATA PREPROCESSING, DATA AUGMENTATION, 

pUTION® NG DATA SETS, MACHINE LEARNING MODELS, 
SUPERVISED AND UNSUPERVISED LEARNING 


6, What do you mean by Hypotheses ? Explain its types. 


s. Hypotheses are educated guesses about possible difference. 
An hips or causes. Hypotheses are statements of expectation about some 
relations! ns of a population. Etymologically, hypothesis are made up of two 
ene o” (less than) and “thesis” (less certain than thesis), It is the 
woe Ps statement of a proposition or a reasonable guess, based upon the 
aaa evidence, which the researcher seeks to prove through his study. 
, Hypothesis is a formal affirmative statement predicting a single research 
a tentative explanation of the relationship between two or more 


03 


outcome, 
variables. 
Simply stated, a hypothesis is an assumption or supposition to be proved 
of disproved. It is a guiding idea, a tentative explanation or a statement of 
probabilities which serves to initiate and guide observation, search for relevant 
data or considerations to predict results or consequences. Hypotheses are 
Measurable and testable. They are of various types based on the manner in 
which they are tested. 
Hypotheses are of two types — 
(i) Directional hypothesis 
(ii) Non-directional hypothesis 


b @ Directional Hypothesis — This hypothesis states a relationship 
an the variables being studied or a difference between experimental 
dete that the researcher expects to emerge. Directional hypothesis ean 
e an ari asa statistical hypothesis. However, a statistical hypothesis can 
the findi = me directional form only when there is a complete certainty that 
i P88 will show a relationship or difference in the expected direction. 


Is z i 
fore because the directional hypothesis can be tested using one-tailed test of 
&ificance, 


(ii) Non-directional Hypothesis — If a given hypothesis do not 

he nature of the relationship between two variables (i.e. whether 

ar negative) or it does not indicate the nature/direction of differences 

we or more groups on a variable (i.e- which group will perform 

en it is known as the non-directional hypothesis. 

informant Present study, null hypothesis are formulated in order to study the 
op literacy skills of student teachers and effect of intervention programs. 


Etween 
Etter) thy, 
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A null hypothesis is non-directional in nature, as it does 
direction of differences between relationships among Variable 
hypothesis is related to a statistical method of interpreting Sin Š 
population characteristics that are inferred from the vari Ch 
observed in the sample. 

Hypotheses are formed to study the existing conditions Th 
hypothesis is individually tested statistically in order to decide whei 
be accepted or rejected. 


SPec A 


isi e R 
1ONs a 
a kA 
ble Telation, 
Me 


S, the 
Crit he 


Q.37. Write short note on tests of hypothesis, 


Ans, A test of statistical hypothesis is a procedure or a rule fo 
whether to accept or reject the hypothesis on the basis of sample vlie a k 
A hypothesis which is tested under the assumption that it is true is 
null hypothesis and is denoted by Ho. Thus a hypothesis which is an 
Possible rejection under the assumption that it is true is known as null hy > 
. The hypothesis which differs from a given null hypothesis, H, and 
Ho is rejected is called an alternative hypothesis and is denoted by Hi _ 


0.38. Write down the rules Sor testing a hypothesis, 
Ans. The procedure for testing a hypothesis is as follows — 


(i) Mention the null hypothesis Hy to be tested al il 
: ni 
alternative hypothesis H}. s aes a, 
, (ii) Make some assumption such as the sample is random, t 
population is normal, the variances of two different population are equal 
unknown. + 
üi) Then find the most appropriate test statisti together wit’ 
sampling distribution. A Statistic whose primary role is that of providing“. 
of some hypothesis is called a test statistic, 

i Gv) On the basis of the sampling distribution, make a decision" 
c er accept or reject the null hypothesis Ho. Let a die be thrown. The l; 
m I hypothesis Hy is that the die is unbiased, i.e. the proportion of aces 
in me number of throws. Let H} be the Proportion of aces 1/7. 

4 en the following Tule be suggested for testing our hypothesis - i 
A ccept Ho if there are more than 17 aces occur in 100 throws. | 
fi Reject Hg i.e. accept H} ifthere are more than 17 aces occur in 100 
dis tony aes 17 has separated the sample sapce into two regions -° 
peas ans Second for the rejection of the null hypothesis Ho y, 
) take a random sample and com est statist g] 
4 pute the tes 3 
calculated value of the test statistic falls in the acceptance region, ted 


the null hypothesis H Ifi i : phei 
o- If it falls i jection, rejec 
hypothesis and accept Hy. Pe meian ot nee 


thee) 
ne 
t 
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piscuss the various techniques used for testing hypothesis. 
39. Discu ee a OE ee g 
Q- There are two types of statistical techniques which are used for testing 
; The ling 
Ans. is. They are parametric and non-parametric te chniques 
SIS. J 


-pypothe: . P ; 
of hyp Parametric Techniques — Parametric techniques can be applied 
eof testing the hypotheses if the following conditions are salistied, 

(a) When the sample is randomly selected. 


(b) When the variances: of the various groups are equal 


forthe PUEPOS 
or Near 
equal. (c) When the data are in the form of interval scale or ratio scale, 

(d) When the observations are independent. 

(e) When the sample size is more than 30. 

(f) When the data follow a normal distribution. 
(ii), Non-parametric Techniques ~ When the above conditions are 
notsatisfied, the non-parametric techniques have to be used. The non-parametric 
testsare population free tests, as they are not based on the characteristics of 
the populations They do not specify normally distributed populations or equal 
variances. 

The techniques which enable us to compare samples and make inferences 
ortests of significance without having to assume normality in the populations 
are known as non-parametric techniques. 

Some of the non-parametric techniques are the chi square test. the rank 
difference correlation coefficient. the sign test, the median test and the sum- 
of-ranks test. The non-parametric techniques do not have the “power” of 
Parametric tests, that is, they are less able to detect a true difference when such 
'S Present. Non parametric tests should not be used, therefore, when other 
More exact tests are applicable. 
ny ot been collected randomly wherein every iat gi 
a anaes ce Since all the conditions required for parametric tes 

> the following techniques are employed. 
(a) t-test 
(b) ANOVA 
(c) œ? estimate. 


240. Explain i i is testi 
. Explain i ii ypothesis testing. 
in detail about multiple hypo (RGM, May 2019) 


i Or 
vi i note on multiple hypothesis testing. (RGP. K, p iin 
Wish to ans multiple hypothesis testing problem is the situation w = w 
have n gen Sider many hypotheses simultaneously. For example, peepee 
"dividual “s and data about expression levels for each gene among y 
5 and those with prostate cancer. 
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| Healthy (k patients) | Prostate Cance, 


x1<j<k Sy Pa 


a 
Xilsjep 


Expression Levelof Genei 
The i” null hypothesis, denoted Hoi» would state that the Mean 
level of the i" gene is the same in both groups of patients, Equivalent 
Nily 
á (9))_ ny) 4 
Ho; :E(xij )= E(x; ) 
We define a type of multiple testing called global testing, and the 
two different global tests. Bonferroni’s test and Fisher’s combination e 
| 


Global Testing — One task in multiple testing is called 
whch we test the global null 
n 
Ho =()Ho; 
i=l 


That is, the global null states that all of the individual nulls are true. Ine 


global testing 


prostate cancer example, the global null would be that E(x) - E(x" g 
i [ 
all <i<n. : 1d 


Suppose that for each hypothesis Ho ;, we already have a test statisti: 
hence, a p-value p;. For simplicity, we assume that Pj ~ ULO, 1]. We wouldii 


b 
y 


to combine pj, ...., py to test Ho. b 


(@) Bonferroni's Method — A simple method for testing the ght 
null, and one which we will see later is difficult to improve upon when tts, 
against sparse alternatives, is Bonferroni’s Method. | 


Procedure — f 
Given a desired level œ, we can test the global nulliby simply testing, 
Ho, at level a/n and rejecting Hy whenever any of the Ho, is rejected ™ 
amounts to rejecting whenever i 


min p; <a/n j 
i i 


We refer to this as Bonferroni’s global test. 


s;  anct igh 
fi Fisher’s Combination Test — Fisher’s combination test isa 
test that rejects for large values of the following statisite - 
n 
T=-)°2logp; 
a if 
i dso io” 


Notice that the function p > - log p increases as p > 0, 9” hype 
Sense that smaller p-values will push up the value of T. Assuming the™ 
are independent, the (finite-sample) distribtuion of T is know? 


i 


Unit- 


pite short notes on the following - 
a Level of significance (ii) Test of significance 
“i Confidence limits (iv) Sampling of variable. 
ii de 
jével of Significance — The probability level below which w 
an X pothesis is known as the level of significance. The region in whic 
sect the 
ample valle 
known as the ¢ 


falling is rejected, is 
ritical region. We 
critical regions 
enc stand 1% areas of 
a eer curve. The shaded 
ee in the figure corresponds 
en level of significance. Hence 
the probability of the yalue of the 
variate falling in the critical region 
is the level of significance. 
Depending on the nature of the problem, we use a single-tail test or double- 
tail test to estimate the significance of a result. In a double-tail test, the area of 
both the tails of the curve representing the sampling distribution are taken into 
account whereas in the single tail test, only the area on the right of an ordinate 
are taken into consideration. For instance, to test whether a coin is biased or 
not, double-tail test should be used, since a biased coin gives either more number 
of heads than tails (which corresponds to right tail), or more number of tails 
than heads (which corresponds to left tail only). 

(ti) Test of Significance — The procedure which enables us to decide 
ithe accept or reject the hypothesis is said to the test of significance. 
Population vg Rissa the differences between the sample values and ae 
signify Side HES (or the values given by two samples) are so large that they 
account for ‘ine against the hypothesis or there differences are so small as to 

Ctuations of sampling. 

iD Confidence Limits — Let the sampling distribution of a statistic 
Salste Sean he mean u and standard deviation o. As in fig. fend ie 
times ie, we x expected to lie in the interval (p a 1.965, p+ 1.965) pee 
S4 L965) in Seon be confident of obtaining p in the interval (S es z 

% contiden 5 % cases. Because of this, we call (S - 1.965, S+ł 5 ) fie 

$ 1.965) are ee interval for estimation of p. The ends of his pene et 
Sto. ore ae to be a 95% confidence limits (or fiducial limit) for S. ony mK 
idence ç ie confidence limits, The numbers 1.96, 2.58 ete., ar oo 
Ous ier The value of confidence coefficients corres ng 

Of significance can be obtained from the normal curve ares. 


Critical 
Region 


Critical 
Region 


= 1.965 p 1.965 


Fig. 1.23 


Sbe no 


eco; 
to Vari 
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(iv) Sampling of Variable (Small Samples) — In Practica ; 
d ; have large sample on account of economical factor Drok 
v end on small sample (< 30). Since the sample: Sg 


cannot assume as in the case of large sample that the random oh 


we cannot alw 


have to d 


0.42. Write shart note on composite hypothesis. (RG RY, Inthe 

Ans. As the order of the integration method is increased, the order 
derivative in the error term associated with the methad, also increases, For 4 
method to produce meaningful results, these higher order derivatives, 
remain continuous in the interval of interest. Also, Newton-cotes type met, 
of higher order sometimes produce diverging results. An alternative toote 
accurate results, while using lower order methods is the use of compa 
integration methods. We subdivide the given interval [a, b] or [-1, Iin 
number of subintervals and evaluate the integral in each subinte 


rval by; 
particular method. This is known as composite or multisegment hypothe; 


(R.GPY, June W. 
Ans. A region (corresponding to a statistic t) is called the sample spar 


The part of sample space which amounts to rejection of null hypothesis Hf,’ 
called critical region or region of rejection. 


Q.43. Write short note on critical region. 


TER (R205 cones Xn) is the random vector observed and We is the crite: 
region (which corresponds the rejection of the hypothesis a. 
prescribed test procedure) of the sample space W, then f 

W,= W-W, | 
ofthe sample space is called the acceptance region. 


Q.44, Write short note on most powerful critical region. 
(R.GBY, Dec M 
Ans. In testing the hypothesis Ho: © = Og against the alternative Hı : Ai 
the critical region is best if the type error is minimumor the power is m4" 
when compared to every other possibl 


e critical region of size a. 
A test defined by this critical region is called most powerful tesh 


jrd! 
Q.45. What do you mean by probability distribution and dis Í 
probability distribution ? 


i Wi 
Ans, Probability Distribution — We have explored the idea ofp 
we can consider the concept ofa probability distribution. In situations * edt 
variable being studied is a random variable, then this can often be M00?) 
probability distribution, Simply put, a probability distribution has two oe i 
~ the collection of possible values that the variable can take, tog! ne ; i 
probability that each of these values (or a subset of these values) a | 
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jase modelling, a probability distribution is the equivalent of 
so. 1n stoc seministic modelling (in which there Is no uncertainty} As we 

on in dete different functions available for deterministic modelling 
ere are many e used more often than others e.g., linear functions. 

of these anand exponential functions. Exactly the same situation 
jal functions odelling. Certain probability distributions are used more 
ae include the Binomial distribution, the Poisson 


niform distribution, the normal distribution and the negative 
u 


s in stoc 


obability Distributions — A discrete random variable assumes 
bie witha certain probability, i.e. each possible value of the 
i has an associated probability. Let X be a discrete random 
each Balue of the random variable have an associated probability, 
P(X = x), such that 
W: |X; | Xo Ra 
px): | pr | P2 ||| | Pn 
The function p(x) is known as the probability distribution of the random 
variable X if the following conditions are satisfied — 
(i) p(x) 2 0 for all values x of X 


(i) Vip) =1. 


Discrete 
each of its val 
random variab| 
variable and let 
denoted p(x) = 


P(x) is also referred to as the probability function or probability mass function. 


0.46. What do you understand by the term Binomial distribution ? 
Ans. The Binomial distribution is a very simple discrete probability 
ibution because it models a situation in which a single trial of some process 
* experiment can result in only one of two mutually exclusive outcomes the 
‘nal (called a Bernoulli trial after the mathematician Bernoulli). We have already 
Mst examples of this distribution in the earlier discussion on probability. 
tig, © obtain the probability ofthe happening ofan ev ent once, ch le = 
event Aine a 7 trials. Suppose the probability of tHe ett 
are n trials and sae p and not happening is | =P a: isla ne sean 
e ha ing ` is esan s s. 
This may be na of the event A is r tim 
wn as follows. 
AA.A 
times 


distr; 


AA, % A 6) 
n — r times ° 


ere A inq. Ta 
^ indicates its happening, { its failure then P(A) = pand P(A) =q 


56 Machine Learning (VI-Sem.) 
We see that relation (i) has the probability 
PPPs G-G..-q = p'g =T 
r times n — r times 
Clearly relation (i) is merely one order of arranging A's 
The probabiinyofretitian (i) = p'q? =" x number of differenta 
of rA's and (n - r) A’s 
The number of different arrangements of rA's and -Yisa 
Probability of happening of an event r times = "Cp" n-r 
P(r) = "C pq", (r=0, 1, 2,...n) 
=(r+ 1) term of (q + p)" 
Ifr=0, the probability of happening of an event 0 times = 2 
If = 1, the probability of happening of an event | times = 
Ifr=2, the probability of happening of an event 2 times 
="C,q"~2p? and so on. 


These terms are clearly the successive terms in the expansion, 
(q + p} Hence it is called Binomial distribution, | 


Copa! = 
"Cah 


0.47. What is Binomial frequency distribution 
of Binomial distribution. 


Ans. Binomial Frequency Distribution — If n dependent tn, 
Constitutes one experiment and this experiment be repeated N times, $ 
the frequency ofr successes is N °C, p" q®. The possible number of suce) 
together with these expected frequencies constitute,the Binomial freq 
distribution. i 

Applications of Binomial Distribution — Bionomial distributer’ 
applied to problems concerning — f 

(i) Number of defectives in a sample from production line. | 
(ii) Estimation of reliability of systems. f 
(iii) Number of rounds fired from a gun hitting a target. f 
(iv) Radar detection. i | 

j jea | 

9.48. Explain the term Poisson distribution, also give the Pi a ; 
of it. 

Š Eoi obabilit 
Ans. Poisson distribution is a distribution related to the prO% ay 


Í 

sf 

: 4 

events which are extremely rare, but which have a large numberofin f 
Opportunities for occurrence. Í 
f 


? Give the applica 


| take. 
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spution is & particular limiting form of binomial distributio 
1 


poiso ms large and p Very small, keeping np fixed (= m say), 
makin: n y of r successes in a Binomial distribution is 
y ili . 
qhe proba - n(n—1)(n—-2)....(n—r+1) , aa 
p = CPT = r! P 
—2p)-.{np-r-h 
np(np ~ p) (np —2p}-t:(np -r -1p) (I-p)™" 
= r! 
= have 
As n>o,p-70(mP m), “A 
m 
NCT a 
. em 
PRONG (27 
n 
So that the probabilities of 0, 1,2,. ".. Successes in a Poisson distribution 
» are given by 
f ee an mĉe™ m'e™ 
e™, me ™, 


r! 


2! 


The sum of these probabilities is unity as it should be. 
Applications of Poisson Distribution — Poisson distribution is applied 
to problems concerning — 
(i) Arrival pattern of defective vehicles in a workshop ‘patients in 
a hospital’ or ‘telephone calls’. 
(ii) Demand pattern for certain spare parts. 
(iii) Numbers of fragments from a shall hitting a target. 
(iv) Spatial distribution of bomb hits. 


0.49, Define continuous distribution. 
es = far we have dealt with discrete distributions where the w 
weight 2 Ue integral values. But the variates like temperatures. heights 


en | | 
we IS can take all values in a given intervals. Such variable are said to bi 
uous variables 


Suppos : 
Ppose ffx) is a Continuous function, then 


Mean =f" * x. f(x) dx 


Variance = i “(x - x). f(x) dx (aremt 
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0.50. What do you understand by the term normaj distri f 

IR.GPV, Ma, 201p for ! 

Ans, Normal distribution is a continuous distribution, Teens yy 

limiting form of the Binomial distribution for large values ofna 
not very small, 


The normal distribution is given by the equation 


f(x) = e 


1 
oV2n 


where, p= Mean, o = Standard deviation, 
e = 2.71828... >= 3.14159... 3 | 
S j _(x-n) 
P(x; <x < x3) =j ie 20? ay f 


, SOM a 4 
On putting z = ~o i" relation (i), we get f(z) = nak A 

2r “ 
Here, mean = 0, Standard deviation = 1. i 
Relation (ii) is known as standard form of normal distribution, 
Moment generating function 


bout x = a is given by 


MOS fI D £0) dx 


-%0 


| 
BE. of the continuous robai 
distribution a E * 


where, f(x) isp 
Applications — Normal distribution is applied to problems concern 
(i) Calculation of errors made by chance in experimental measure, 
utation of hit Probability of @ shot. | 
tical inference in almost every branch of science. 


ort note on exponential distribution. 


Ans. The distribution ofa random Variable whose natural logarithm Ñ“. 
a normal distribution is called as the exponential distribution. | 


` The exponential density function is given by 


: | 
f(x) = 5 (log x-1)"/20 

oxV2n j 

whe: of 

dana range of random Variable is x > 0, The parameters p | 

i u = E(log X) | 

©? = V(log X) i 


sance — The mean and variance of exponential distr 
riar oe nie? 

E(X} = el 202 

V(X) = e7" 


tean and VA 


nectively- 
respet jon: 
ficatior 
ApP n used t0 


f Exponential Distribution ~ The exponenti 
soir 


al distribution 
describe lifetimes of electrical and mech 


amical s 


ms, 
has bee! f species of animals. incubation period of infectious diseases, 
abundance © Paeaient elements ifyeological materials, and many other 
i O: 
5 -entration 
+ concen 


mena occurring in both the social and natural sciences, 
heno! 
random p! 


52, Explain the term frequency distribution, 
a Frequency distribution is an arrangement of data according to the 


(called frequency) possessing the individual or grouped values of the 
number (ca s 


l variable. 


In other words. a tabular form of the data in which the frequencies of the 
o vords, a t a 1 
l sa variable are given along with them is called a frequency distribution 
values of a g g i i 
() Univariate Frequency Distribution — A frequency distribution 
which shows the frequency of occurrence of different values of a single variable 


| iscalled a univariate frequency distribution. 


(ii) Bivariate Frequency Distribution — A frequency distribution 
based on two variables is know n as bivariate frequency distribution. 


(iii) Discrete Fi requency Distribution — A frequency distribution 
which is formed by distinct values ofa discrete variable or a continuous variable 
is called a discrete frequency distribution. 


R (iv) Grouped Frequency Distribution — A frequency distribution 
Which is obtaine 


dby dividing the entire range of given observations on a discrete 
°T continuous variable into groups and distributions the frequencies over these 
aes called a grouped frequency distribution, The groups are called the 
classes and the boundary ends are called class limits. For the class 15-20, say 
Upper limit and 20 is the upper limit. The difference s 
width spa Ower limits of a class or class interval is called its magnitu = 
clasg on Class interval, The number of observation falling within a partici u 
Which li ee "Ss frequency or class frequency. The value of the — p 
idp mid-way between the upper and lower limits is called mid-valu 
“nt of that Class, 
2.53, Explain the (etm tecih 


ns, via 
tongi nas Probability distributio. 
| er ent 

“aig toh, £ entire 


'S the lower 


gular distribution. e 
the probability function f(x) 13 


a <x <band zera elsewhere 


n in which 
riab! 


Fieprih 


Aion 
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o a b 


x f 
Fig. 1.24 Graph of Rectangular Distribution j 


This distribution is so called since the curve y= f(x) describesa = 
over the x-axis and between the ordinates at x = a and x =b, This implisg 
X is a continuous variable. Sij 


Hence X is a rectangular variate in the range [a, b], we have | 

b b i 

J.a = 1 ie, fof’ ax=1, Í 

a a H 

as f(x) is constant. ' 
f 

Rx) (b-a)= Lie, f(x)= l i 

a | 

| 


Thus a rectangular distribution is given by probability function 


1 
fi 5 
(x) boa’? 2SX<b | 
=0, elsewhere | 
Distribution Function of X — The distribuition functionlof X is! 
0, x<a 
X~a 
F(x) = boa’ 2SxSb 
I, x>b i 
Cor.1. The probability that an 
observation fall in any interval within 
TER b 
aSx<bis —, times th a 3 jib 
b-a e length of Fig. 1.25 Graph of Distribu” f 


the interval. Function 


Suppose (c, d) is the new interval so that 


d 
P(esxsd)= f 2 
c b-a 


ofa 
| dimen ately the 


Units § 
-, in detail about the term data Preprocessing, 
fain 


54. EP rocessing is an important step in any machine learning 
, Data ae transform the raw input features into a feature space that 

plem that aims ble by a machine. The most common preprocessing steps 
pre” jy interpreta! and whitening. Here X denotes the dataset, p the Population 


" sily interPr a 
is dardization tandard deviation. 


ulation S 
athe Pa — Standardization is the most popular form pf 
jon oe commonly comprised of mean subtraction and subsequent 
preprocessing oe dard deviation. The reason for mean subtraction is mainly 
saling bY the ean input data creates a loss surface that is steep in some 
that aie hallo À in other such that it slows down convergence of, gradient- 
ee on techniques. Conversely, input data that has a large variation 

6 aes : 
y along different directions negatively affects the convergence rate, 


Sta 


Mean subtraction can be formalized as — 
1 n 
=—? Xi 
p= a 


xXO=X-p 


| where p denotes the mean and X(® the zero centred data. 


Mean subtraction has the geometric interpretation of centring the cloud 
ofdata around the origin along every dimension as shown in fig. 1.26 (b). 


p (b) Zero Centred Data (c) Standardized Data 
Fig. 1.26 Visualization of the Standardization Transform 


(a) Original Data 


Standardi... ees 
“ardization refers to altering the data dimensions such that they are 


Same scale. This is‘commonly achieved by dividing eac 


tension 2 
by its standard deviation once it has been zero centred as in 
1 n 
o= DXi -H) 
Bi 
i=] 
Where xo- X-H 


o 
Enote Sc Pee e 
5 the standard deviation and X“! the standardized data. 
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Dividing by the standard deviation has the geometric 
tering th vad of the data such that the data dimensions ; 
uterine i 


to each other. 


Whitening — Ht 1s sometimes not enough to centre and Scale the : | 
independently using the standardization process, since a downstream mag i 
further make assumptions on the linear independence ofthe features, Toa 
this issue, we can make use of the whitening transformation t0 farther r | 
the linear correlation across features. There are many possible ways to He f 
whitening transformation such as zero component analysis (ZCA) or pin 
component analysis (PCA). PCA is also widely used in the machine ie 

the purpose of dimensionality reduction. The whitening transformation isa, i 
step process that involves decorrelation using the computed Cigenvecion b 
subsequen. scaling of the decorrelated data with the cigenvalues, oh 


I 
The first step of PCA whitening is to perform the singular wad 
decomposition (SVD) of the covariance matrix as in — | 


U, S, UT = SVD(x) 

where X denotes the covariance matrix, U its eigenvectors, and S Contains! 
corresponding eigenvalues along its diagonal. 
In order to decorrelate the data, we compute the dot product ofthe 2) 
centred data with its eigenvectors as in | 
xd = XU ! 

where U is the matrix of eigenvectors and X(® denotes thëidecorrelated dt 
Correlated input data usually causes the eigenvectors to be rotated a! 
the coordinate axes and thus weight updates are not decoupled. HOW 
tive optimization method mitigate the need for decorrelation since F 
ghts are updated at different rates independently. The geom?" 


interpretation of decorrelation is to align the data in the directions of the mt 
variance as shown in fig. 1.27 (b). 


5 


ee 


from 
adap 
wei 


gta 


{a} Original Data 
Tig. 2. 


Œ) Decorretated Data te) Whitened D! 


PL e ae Lie Aas é 
sof rBuanzation of te Whitening Transfert 
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ö eratio 


F ue to normalize the scale as in 
‘oenva 
qs igen’ XU 
x)= ce 


x of eigenval 


isthe matri per that is added to avoid division by zero and therefore 
mall numb’ 
esa. ility. PNG : i 
a ae numerical cae een of this transformation is that if the input 
ire geometric in ner 4 then the whitened data will be gaussian with 
paisa makia vant covariance matrix as shown in fig. 1.27 (c). 
d iden 


do you mean by data augmentation ? Explain. 
t do, 


0.5% ie gmentation is a common method in deep learning used to 
Ans. Dataa ing. The idea is to expand an existing data set using 
tof overfitting. 
reduce the effec! 


ing algorithm can more effectively 
i lata so that the learning t 
only thg -W g essential to the task. To train deep learning models, 
ae ee th: i required, usually from manual data collection or 
id oa sets are od ; i y ly a limited data 
yp isti b: s. However, in some cases only 
fom already existing databases. T, i E 
tisavailable. Therefore, to expand the size of the data set, data augmen 
iS aV; . > A : om ss 
cathe employed. The complex indoor environment and tee dere 
problems because of the limited coverage of Wi-Fi APs neem ae 
measurements. The purpose of data augmentation in this as TERTA 
remove faulty measurement data or to remove invalid data, t Sr = nias 
accuracy and efficiency of the entire positioning system by creating sare 
representation that is more suitable for downstream deep learning c wd 
Data augmentation adds value to base data by adding grote pri 
fom internal and external sources within the database. It can also ret said 
Manual intervention required to develop meaningful information me 3 
a from business data, as well as significantly enhancing “a c i yh 
aed Wwe can produce multiple copies of available data with slig! ae 
in hie techniques used in data augmentation include pacar 
ting the relevant fields are updated or assigned values based an mn a 
: i eas 
the grou = Which common records are tagged to a group, mene Err 
csimated f understood and differentiated, aggregation in ome i see 
And Probal ny ant fields if needed using mathematical Guat of 
“vents based S in which values are populated based on the p 3 
on heuristics and analytical statistics. 
Qs D 
es 


s ii P a) so specify their 
€ rang, 2 cribe various forms of data normalization. Also sp’ 


valu, 


A tatr fall within 
3 smal] A attribute is normalized by scaling its values so that they fall : i hig 
ecified range, such as 0.0 to 1.0, Normalization is use 


n takes the data in the eigenbasis and divides every 
t S 


ues and X? denotes the whitened data. The 
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Jving neural networks or distane m Í 
ž Castra, | 


classification algorithms invo 
such as nearest-neighbour classification and clustering. If using 

network backpropagation algorithm for classification mining, no Š the 
input values for each attribute measured in the training tuples net Pee 
up the learning phase. For distance-based methods, normalization a 
attributes with initially large ranges from out weighing attributes wi pry 


smaller ranges. There are many methods for data normalization, So h inig 


are as follows — Me of, 
(i) Min-max Normalization — It performs a linear transfo, | 

on the original data. Suppose that min, and max, are the mini Ors! 
maximum values of an attribute, A. Min-max normalization maps y 
i 3 att 

of A to v' in the range [new_min,, new_max,] by computing Value 
v-—min, 


f 
new_ m - E? f 
(new_ max, — new _ mina) + new_min | 


max q — min, 
Min-max normalization preserves the relationships among the org 


data values. It will encounter an “out-of-bounds” error if a future input a 
for normalization falls outside of the original data range for A k 3 
| 


; (i) Z-score Normalization — It is also called zero-ma 
normalization. In this normalization the values for an attribute, A, il 
normalized based on the mean and standard deviation of A. A value, ¥, ofA: 
normalized to v' by computing h | 

f 


Sa 


where A í 
and o 4 are the mean and standard deviation, respectively, ofa 


A. Thi: D a 
sat ot er arin is useful. whenthe actual minimum */ 
ibute A are unknown or when i a, 
i e: that domi 
the min-max normalization. there are outliers | 
f 


ssi as Decimal Scaling Normalization — This method normalize? 
Cathe towed Fs point of values of attribute A. The number of deci 
lepend on the maximum absolute value of A. The value. vo 


is normalized to v' by computing 


v 


105 
where, j is i 
J is the smallest integer such that Max((v'l) < 1. 


O57, Writ explain di 
Ans. The — plain different types of machine learning: 
ierent types of machine learning are shown in fig. f 

| 

i 

t 


Semi-supervised 
Learning 


Fig. 1- 28 Types of Machine Learning 

i sg — In this type of leaming, the machine i 
rvised Learning Jn this | ‘ chine i 
Supe of inputs with their desired outputs. The machin 
ets ofinputs and outputs and find a general functior 


Reinforcement 
Learning 


Unsupervised 
Learning 


i) 
Wah a given set 
those given 5 
ts to desired outputs. 
ing — This type of learning is termed as 


gsto study 
hat maps [PPY 
(i) Uns > 

y its own’ by discoy 
ing the data are divi 


inthis learning MI" : 
js called a clustering algorithm. 


(iii) Semi-supervised Learn 
applications as supervised learning. Bu 


upervised Learn. 
overing and adopting, based on the input pattern. 


ded into different clusters and hence the learning 


ing — This learning is used for the same 
t it uses both labeled and unlabeled data 


fortraining. This type of learning can be used with methods such as classification, 
regression and prediction. Semi-supervised learning is useful when the cost 
associated with labeling is too high to allow for a fully labeled training process. 
Early examples of this include identifying a person’s face on a web cam. 


iv) Reinforcement Learning (RL) — In this type of learning, machine 

istrained to take specific decisions based on the business requirement with the 

ae maximize the efficiency (performance). This continual learning 

T participation of human expertise and saves more time. 

With reinfordement Jent is often used for robotics, gaming and pe 

which actions yield kene the algorithm discovers through trial and error 
e greatest rewards. 


9.58. Wh 
` a . . 9 
Ans. Su t do you mean by supervised learning ? 
A ™ ervi r 7 
Set is E a learning means learning from examples, where a training 
“scription ee acts as examples for the classes. The system finds a 
‘predict the ane h class. Once the description has been formulated, it is used 
analysis which ass of previously unseen objects. This is similar to discriminate 
Superyi Occurs in statistics. 
dat ised learn; a ; M 
a Supervised leans deals with learning a function from available training 
eae ed functio, earning algorithm analyzes the training data and produces 
i Süper which can be used for mapping new examples. Common 

(i) sed learning include — 


9 Classityi 
Ssitying e-mails as spam 
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(ii) Labeling webpages based on their content | 


(iii) Voice recognition. i 


There are many examples of supervised learning algorith i 


achines), Naive Bayes classifiers, neura} ms Weg, 

(support vector m . neta 

decision trees. 

Features | 

‘Training Vectors | 

Text i 

Documents, | 
— 


Images, 
Sounds... 


Machine 
Learning 
Algorithm 


New Text 
Document, 
Image, 
Sound 


Features 


Predictive 
Vector 


Model 


| 
i 
t 
| 
| 
| 
| 


Fig. 1.29 Supervised Learning 


Ans, There are six issues taken into account while dealing with superis 


| 
| 
Q.59. Describe the problem and issues in supervised learning. 
learning as follows — i 


@ Heterogeneity of Data — Many algorithms like neural newt} 
and Support vector machines like their feature vectors to be homoge] 
numeric and normalized. The algorithms that employ distance metrics are Wt) 
Sensitive to this, and hence if the data is heterogeneous, thesesmethodssh* 
be the afterthought. Decision trees can handle heterogeneous data very | 


ij i i į 
re fie iene 6 of Data — If the data contains redundant inform) 
®: in highly correlated values, then it’s useless to use distance bs 


methods bg : $ i 
S because of numerical instability. In this case, some "| 


Tegularizati mss i 
g tion can be employed to the data to prevent this situation. | 


% By Ù Dependent Features —\f there is some dependence ne 

— ts algorithms that monitor complex interactions i | 
‘ion trees fare better than other algorithms. 

(iv) Bias-Variance Tradeoff— 


feature v 
network: 


4 nh: ine”; 
different b The training set may conia y| 
biased eee good data sets, Now the learning algorithm is sa at 
systematically ines input if, when trained on these data ei 
a eonen while Predicting the correct output for that i i 

s an, * b 


n2 y 


o 


ST WE 


hm bas a high variance for a particular input when it 
4 nee pon trained on different data sets. Thus, there is a 
output 


utp aitferent oO and variance and supervised learning approach is able 
e 


of Training Data and Function Complexity — The 
) Amount d to provide during the training period depend on the 
of data zegi required to map from the training data set. So, for 
-y of the on low complexity, the learning algorithm can leam from 
e function 3 ta. Whereas on the other hand, for high complexity 
aA a fei thm needs large amount of data. 

ctionS, 
fun wi pim 


the learning 2 À 
ensionality of the Input Space — If the input feature vectors 

ea jon then the learning algorithm can be difficult even it depends 
have high pene r of features. This is because the many “extra” dimensions 
ona small m call mning algorithm and cause it to have high variance. Hence, 


one’ dimensionality typically requires tuning the classifier to have low 
I 


variance and high bias. 
0.60. What do you mean by unsupervised learning ? 


‘Ans. Unsupervised learning is learning from observation and discovery. 
In this mode of learning, there is no training set or prior knowledge of the 
lasses. The system analyzes the given set of data to observe similarities 
emerging out of the subsets of the data. The outcome is a set of class 
descriptions, one for cach class, discovered in the environment. This is similar 
tocluster analysis in statistics. 


Features 


Vectors Machine 


Learning 


—__ 


Algorithm 


| Likelihood 
or 
Features Craster Id 
Vector ~~ ee oc 
Better 
Representation 
ri Unsuperyig a Fig. 1.30 Unsupervised Learning 
e e š s ; 
Wine ed data s Saming makes sense of unlabeled data without having any 
imo 1 tool for “A i its training. Unsupervised learning is an extremely 
“teom only nalyzing available data and look for pattems and trends. ft 


used for Clustering similar input into logical groups. Common 
fan ee 
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approaches to unsupervised learning include — 

(i) k-means 
(ii) Self-organizing maps 
(iti) Hierarchical clustering. 

Some popular examples of unsupervised learning al 
(i) Genetic algorithms 
{ii) Clustering approaches 
(iii) A priori algorithm for association Gule learning Proble, 


gorithm ae 


Q.61. Write short note on semi-supervised learning, 


Ans, Problems where we have a large amount of in 
some of the data is labeled are called semi-supervised learni 
problems sit in between both supervised and unsupervised learning 

A good example is a photo archive where only some of the i 
labeled, (e.g. dog, cat, person) and the majority are unlabeled. Many 
machine learning problems fall into this area. This is because it can be expen) 
or time-consuming to label data as it may require access to domain capil 
Whereas unlabeled data is cheap and easy to collect and store. 

We can use unsupervised learnin 


| 
structure in the input variables. We cana 


g techniques to discover and leam tk} 
Iso use supervised learning technigug| 

to make best guess predictions for the u 

the supervised learning 


nlabeled data, feed that data baci 
predictions on new 


Put data ang af 
ng problems, Thy! 


Mages g! 
Teal wor}, 


algorithm as training data and use the model tomk 
unseen data. 


0.62. Differences between Supervised and unsupervised learning. 


[R.GPV., May 2019 (VISee 
between supervised and unsupervised learning ar * | 


Ans, Differences 
follows — 


Supervised Learning 


Unsupervised Learning 
Knowledge of output learning 
with presence of an expert, 
Data is labelled with a class 
or value. 


Its goal is to predict class or 
value label, 

Examples — Neural network, 
SVM decision tree, Bayesian 
classifiers, etc, 


No knowledge of output class 
or value. 

Data is unlabelled or value 
unknown. set 
Its goal is to determine 02 
patterns. ee 
Examples — k-means. eat | 
algorithms, clustering 4pP 


LINEARITY, ACTIVATION FUNCTIONS | 

LINEARITY Vs ROM ELU, ETC., WEIGHTS AND BIAS, Loss 

LIKE Set ERADIENT DESCENT, MULTILAYER NETWORK, — 
ers BACK PROPAGATION 


0.1 What is the peron between artificial ihaiigenoe, machine 
ring and deep learning ? 
<2 Deep Bing algorithms are multi-level nepreseaation pai 
techniques that allows simple non-linear modules to Henao eae sir 
fom the raw input into the higher levels of abstract representations, with many 
ofthese transformations producing learned complex functions. 

The use of representation learning, which is the technique that allow 
machines to discover relationships from raw data, needed to perform certain 
lasks likes classification and detection. Deep learning, a subfield of machine 
leaming, is more recently being referred to as representation leammg in a 
Merature, The direct relationships between deep learning and her associate 
fields can be shown using the relationship Venn diagram in fig. 2.1 LANN l 
ae igs Past six decades, machine learing field, a branch of artificia 
tefcldpninew ee rapid expansion and research in 
‘Sects of fie omentum to diversity into different 
Feld of study a Sxistence, Machine learning is a 


science tines that uses the statistics and computer 
Oper ciples, to create statistical models, used 
These fogs tasks like predictions and inference. 
betwg 4 Š are sets Of mathematical relationships 
The fg cc and outputs of a given system. 
pis eS îs the process of estimating the 
ing ified task Eor that the model can perform Fig. 2.1 Venn Diagrant 
Beng” give machi machines, the learning process of the Components of 
Pro, es the 


ability to learn without 


med explicitly, 


Artificial Intelligence 
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The typical artificial neural networks anae biolo ically ia 
computer programmes, designed By he insiti ofthe workings org 
brain. These ANNs are called networks S yare composedo hin 
functions, which gathers knowledge by detecting the relationship 
in data using past experiences known as training examples in Most li aj 

The early deep learning algorithms used for recognition task, h Mi 
layers in their entire architecture, with LeNet5, having Just five layer | 
network layers have witnessed depth increases since then, with AlexNe tas 
twelve layers. VGGNet having sixteen and nineteen layers in itg two a, 
twenty-two layers in GoogleNet one hundred and fifty-two layers in thela | 
ResNet architecture and over one thousand two hundred layers in Stocks| 
Depth networks, already trained successfully, with the layers still j 
to date, With the networks getting deeper, the need to understand th 
of the hidden layers and the successive actions taking place within 
becomes inevitable. 


crea | 
e mila! 
the | 


Q.2. Describe the term linearity vs non linearity. 


t 
| 
Ans. Most of real world data is non-linear in nature. Linear anen 
perfectly handled as it exhibits usual data patterns with a one to one relations 

between the input and the output. Non-linear data patterns are not usual thy 

continuously gets changed. To model these non-linear data patterns then! 
linear activation functions are used in NN models. 


The linear functions have a constant derivate and exhibit a constant nt 


of change, hence they have a constant descent at every iteration. Hence uss: 


a linear activation function the NN can learn only on fixed patterns whichut 
common in all iterations. 


Unusual data characteristics cannot be addressed! 
the linear activation functions. 

With the advent of large data bodies the need for studying the noni] 
characteristics of data has grown. Methods which’can learn them are foa's*| 
Non-linear activation functions came to usage. The non-linear activst® 
functions are curved in nature with Many gradients. At these grades 
oo optimal exists. The main design principal of the NN is to mini” 
gel ria is the difference between the actual value and mr ci 
i he his constructive principal the NNs use | mit af 
tivation ge to local minimum, On this non-linear bounda" | 

nction the NN learns possible unusual data patterns. y 

The non-linear activation functions are curved boundaries which can’ s 


to the non-linear ch qat” 
anges of t J pl now 
complex features aie he output. The NN model can 


e input that are i boundary. 
For example let ü e mapped on to the bo 


. . A y 
S$ consider the linear activation function give" f 
Y = F(X) = x 


ii 
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co mapping of the linear function given by equation (i). 
„T 
ne 


appi f Linear Function 
1 1/0 Mapping 0 
Table 2- 


6 [0 
H 


-aput from 0-10 the boundary of the output is from 0-10, The 
e inp ction hasbeen trained on the inputs from 0-10. The 
ae adapted to learn the same outputs as that of the inputs. 
getivation tos tive powers of the NN are not so accurate as the output is same 
Here the predicare E x = 3 has some error then the NN failed to reduce this 
sit il the same,output Y = 3. This is the major drawback of 
1 
e activation functions. l E | B 
onthe other hand consider a non-linear activation function of the sigmoid 
type given by equation (ii). 
F(x) = [/I+e™ 


Considerthe I/O mapping of the sigmoid function. 


r tl 


conside 


Here for th 
activation 
: function h 


(ü) 


Table 2.2 1/O Mapping of the Non-linear Function 
[x]-e] o 1 3 | 4 |5 le] 
| 
| 
| 


E 0 }os [0.96 | 0.97 | 098 0.99 | 1 
a | 


a | 
d 


0.7 | 0.8 


7 
| 


idee aa input given the output varies between 0 and 1. While the 
the new output tike o en data like 0. 1, etc. the model is adopting to learn 
a new output kounan ete. for a given input boundary the model has leamed 
Siemoid fi ary. This shows the non-linearity characteristic of the 
thi linear the output is non-linear. Because 


Uncti 

‘this Norling; rough the input is 

NN arity che ee : 5 : : 

NN, any characteristic the sigmoid function is majorly used in ihe 


23. Expla; 
“Plain the linear and y 


Ans, . ton-linear function. 
ate ag Some linea ái 
$ follows Š 


a i : : ” 3 
and Non-linear activation functions 


a linear activation 


Fig. 2.2 Identity 
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(ii) Binary Step Function — This function is given 


1; 
f(x) = 0; 


by 
if (x)20 
if (x) <0 


where 6 represents the threshold value. Mostly single layer 
nets use this function for determining the output from the 
net input. It is also known as threshold function or heaviside 


Fig. 23 
function. This is shown in fig. 2.3. 


w 

Bi 
Step Fung 
(iii) Sigmoidal Functions — These are used in m 
such as back propagation, radial basis function network 
functions are S-shaped curves. The functions are logisti 
functions commonly used. The types of Sigmoidal functions are as follow; 

(a) Binary Sigmoidal Function — This 

is also called logistic function or unipolar function. 


The range of this function is between 0 to 1. It can be 
written as — 


ultilayer ety, 
ete, Usually th 


ic and hyper, 


f(x) 


1 
i+ 


where o represents the steepness parameter. On 
differentiating f(x), we get 


f' (x) = of(x)[1 — f(x] Fig. 2.4 Binary 


| 
The binary Sigmoidal function is shown in fig. 2.4. Sigmoidal Funct 
2 


f(x) = 


(b) Bipolar Sigmoidal Function — It ranges between +1 a 
~1. This function is given as 


_l-e% 
ex l+e* 
On differentiating the function, we get 


f(x) 


POO = 201 + AN) o 


Fig. 2.5 shows the bipolar Sigmoidal function 
which is closely related to hyperbolic tangent 
function and can be defined as 

eX-e% I-e? 
nia ore 
e+e I+e 


On differentiating the function h(x), we get 
h'(x) = [1 + h(x)] [1 ~ ho] 


Fig, 2.5 Bip 
Sigmoidal Fe 


Unit- 73 
‘ar Step Function — This function 
ola 
(w) Bip 


+1 


1 ifx 20 
wo fa ifx <8 


is used in single layer nets to 
n 


is functior put to an output that is bipolar 
n] 


w the net i Fig. 2.6 Bipolar Step 
ns This is shown in fig 2.6. l Finelon 
ue () Ramp Function — This function is 

v 
= 1 ifx>T 
f(x) = 4x ifO0<x<1 
0 ifx<0 71 
. Fig. 2.7 Ramp Function 
fig. 2.7 shows the ramp function. 8 ıp 
0.4, Explain in detail activation function. 
Or 
Write short note on activation function. (R.GP.V.,, June 2009) 


Ans, The activation function is used to calculate the output of an artificial 
neural network (ANN). The sum of the weighted input signal is applied with 
inactivation to obtain the response. Same activation functions are used for 
neurons in same layer. There may be linear as well as non-linear oe 
functions. The information processing of a processing element made up o! 
Wo main parts — input and output. A function called integration a 
sociated with the input of a processing element which combine the 
deere evidence and activation from other source and oiber poeeme 
hac Into a net input for the processing element. To guarantee se a 
inea i bounded, we use the non-linear activation function. wee 
y a vation functions are used to obtain the benefits ofa multilayer oe 
ta “ngle-layer network. Non-linear functions are used fe ere 
Multilayer aa of linear functions because if a signal Doiie 
that i With linear activation functions, the obtain 
orci the single-layer network. 

Ta linear model, 


Perform, 
“hapa, the hidde 


IS Dive, 
x "tsformen ve by th 


. ut, as 
a linear mapping of an input fetio k a for 
n layers before the final prediction of Sir ut vectors 
e affine transformation in most cases. The mp 
lon is given by 
“here x Swys b 

"DUE, w = Weights, b = Biases. 


Eo 
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Furthermore, the neural networks produce linear Tesults 
from equation (i) and the need for the activation function ari 


Mar: 
: fi Ses, first iy 
these linear outputs into non-linear output for further com 


i 
from the | 
Putation, e o 


to learn patterns in data. The output of these models are given by "Seg 
Yy = (WyX, + WoX2 t+... + W,X, + b) | 

These outputs of each layer is fed into the next subsequent i -fi 
multilayered networks like deep neural networks unti] the Yet fe 


final 
obtained, but they are linear by default. The expected output utp 


ly 
ae $ S . derem] 
type of activation function to be deployed in a given network, 


However, since the output are linear in nature, the non! 
functions are required to convert these linear inputs to non 
AFs are transfer functions that are applied to the output: 
to produce the transformed non-linear outputs, ready 


The non-linear output after the application of the activ. 
by 


linear actini! 
-linear outputs, Š 
S of the linear Thode 
for further processi 
ation function is gva 


Y= A(WIX] + WX. +..... + WaXn + b) 


where, a is the activation function. 


ü 


The need for these activation functions include to convert the linear inp 
signals and models into non-linear output signals, which aids the leaning! 
high order polynomials beyond one degree for deeper networks. A speci} 
property of the non-linear activation functions is that they are differenti 
else they cannot work during back propagation of the deep neural network 


2.5. Why do we use threshold function/activation function ? 
(R:GPV, Dec. 208 
Or ? 
What is the role of activation Junction in artificial neural nekik 
[R-GP.V, June 20160 
Ans. The role of the activation furiction can be understood by wi 
Suppose a person is carrying out some work. Some force or activation ro 
given to make the work more efficient and to get exact output. This pee ] 
aids in obtaining the exact output. In a similar manner, the activation 
is applied over the net input to determine the ANN output. 


@.6. Compare merits and demerits of activation functions. poft 
oe u 
Ans. The step function is very simple to implement. The ONTP i 
function is saturated. But unlike the step function, the ramp`!$ i varist? 
Small variations in net weighted input cause correspondingly sma 


: uh 
in the output. At the loss of the simple ON/OFF description of the ou? 
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N gained. Sigmoid functions are continuous, The merits of 
epope is that their smoothness makes it simple to devise learnj 
esi? functions is tand the nature of large networks whose nodes calculate 

andur ae plotted against the net input to a neuron, experimental 

ctions. ical neurons demonstrate that the neuronal firing rate is 

A gof biolog the Brooklyn Bridge can be sold simply to anyone 

jgmoidal. biological neurons perform any precise mathematical 

He iloa, From thesoftware or hardware implementation 

am ntiation is an expensive computational process, and one 

view ne such extensive calculations make a real difference for 
uesti 


ractical neural networks. f : EERS 
p 0.7. What doyou mean by Sigmoid function ? Also describe it advantages 


a, © oid is a non-linear activation function used mostly in 

P Ma networks. It is a bounded differentiable real function, 
W real input values, with positive derivatives everywhere and some 
B of smoothness. The Sigmoid function is given by the relationship - 


1 
nag [ats] 


The Sigmoid function appears in the output layers of the deep leaming 
architectures, and they are used for predicting probability based output and 
has been applied successfully in binary classification problems, modeling 
logistic regression tasks as well as other neural network domains. It has seen 
fequent use historically since it has a nice interpretation as the firing rate ofa 
eo from not firing at all to fully-saturated firing at an assumed maximum 
hee main advantages of the Sigmoid functions as, being easy F 
the ee used mostly in shallow networks. Another ae & 
network a activation function should be avoided when initializing the neui 

™ small random weights. 


In ý 
di Practice, the Sigmoid activation function has recently fallen out of favour 
Starely used 


hr since it has two major drawbacks. / EA 
Satura e tndesirable Property of the Sigmoid function is that its activations 
either tail of zero or one and the gradient at these regions is very 
the wa During back Propagation, the local gradient will be multiplied 
adient is na OF this layer’s output for the overall loss. Therefore, if the local 
“enal will bd Small, it will effectively diminish the gradient and poo’ 
e oi through the unit to its weights and recursively to its data. 
anishing gradient problem. 


pelieves 
yl $ 
>a ration like 


* 
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Another undesirable property of the sigmoid activation nction; 
output is non-zero centred. This has implications on the dynamics a i is he 
during gradient descent since the data flowing into a neuron is alwa ` t 
The gradient on the weights during back propagation will eith 

positive or all-negative, depending on the gradient of the wh 
This could introduce undesirable zig-zagging dynamics jn th 
weights. 


ole ey ú 
e Updates ofp 
Q.8. Write short notes on the following — 

(i) Hard Sigmoid function 

(ii) Sigmoid-weighted linear units (SiLU). 
Ans. (i) Hard Sigmoid Function — The hard Sigmoid activ. 


ki : 3 ation isani 
variant of the Sigmoid activation function and this function is given by 


The equation (i) can be re-written in the form 


f(x) = mao, min(1, & : 2) fi 


A comparison of the hard Sigmoid with the soft Sigmoid shows thal: 
hard Sigmoid offer lesser computation cost when implemented both in: 
specialized hardware or software form as outlined, and the authors highligh! 


that it showed some Promising results on deep learning based bini 
classification tasks. 


(ti) Sigmoid-weighted Linear Units (SiLU) — The Sigmoid-weitt 
linear units is a reinforcement learning) based approximation function. È 


SiLU function is computed as Sigmoid multiplied by its input. The AF4% 
SiLU is given by 


ads) = zig) a 


jd 
where s = Input vector, 2, = Input to hidden units k. The input to the t 
layers is given by 


A 


z= D, Wiksi +b, 
1 
In equation (iv), 


sing? 
f ; by is the bias and Wix is the weight connec’ 
hidden units k respectively, 


The SiLU function can 
neural 


ad 


the 
only be used in the hidden layers of 
networks and only for reinforcement learning based syst@™ 
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ou mean by ReLU ? Also descrihe advantages and 


9, What 40 

P C tages of a d linear unit ( ReLU} is another type Of activation function 
ist <The rectifie spwork computing, This activation function was Proposed 
AM iq neural a 2010, and ever since. has been the most widely used 
es d hipon e deep learning applications with state-of-the-art results 
„on function tO a faster learning, which has proved to be the most 

ë ReLU A used function. It offers the better performance and 
nd wide! ea compared to the Sigmoid and tanh activation 
Ee resents'a nearly linear function and therefore preserves 


functions. The f linear models that made them easy to optimize, with gradient- 


the properties © 
descent methods. . n : : : 
LU activation function performs a threshold operation to each 
T x where values less than zero are set to zero thus the ReLU is 
input eleme: 
given by 


Xj, if %, 20 


i i i if x; <0 


This function rectifies the values of the inputs less than zero thereby forcing 
them to zero and eliminating the vanishing gradient problem observed in the 
earlier types of activation function. The ReLU function has been used within 
te hidden units of the deep neural networks with another activation function 
‘sed in the output layers of the network with typical examples found in object 
classification and speech recognition applications. 


í The main advantage of using the rectified linear units in computation i 
at, th 


5 ials 
ni €Y guarantee faster computation since it does not compute exponentia 
Nd diy 


ef i : o "i 
ofthe aa with overall speed of computation enhanced. Another property 


í ae p nee ishes the 
valu ELU is that it introduces sparsity in the hidden units as it squishes i 
€ AL between zero to maximum. However, the ReLU has a limitation that i 

Sily overfi ; 3 


3S been a compared to the Sigmoid function although the some ESER 
networks P to reduce the effect of overfitting of ReLUs and the 

Proved Performances of the deep neural networks. . 
SLU has a significant limitation that it is sometimes fragile during 
cin causing some of the gradients to die. This leads pkan 
ing dead as well, thereby causing the weight updates not to seit ne 
nin, thereby hindering learning as dead neurons gives zct 


eR 
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Q.10. Define the following terms — 
G) Leaky ReLU (LReLU) 
(ii) Parametric ReLU (PReLU). 


Ans. (i) Leaky ReLU (LReLU) — The leaky rectified linear ti 
proposed in 2013 as an activation function that introduce some nit (LR i 
slope to the ReLU to sustain and keep the weight updates un Depi 
entire propagation process. The alpha parameter was introduced ‘ d 
to the ReLUs dead neuron problems such that the gradients wil] noth 
any time during training. The LReLU computes the gradient with be Zy 
constant value for the negative gradient o in the range of 0.01 thus Very sma 
AF is computed as — the LRety 


‘ x ifx>0 
x)=ax+x= y 
(x)= OX +X= lox ifx<0 

. The LReLU has an identical result when compared to the standard Rely 
withan exception that it has non-zero gradients over the entire duration rine 
suggesting that there no significant result improvement except in sparsity ai 
dispersion when compared to the standard ReLU and tanh functions. Th 
LReLU was tested on automatic speech recognition dataset. 


: (ii) Parametric ReLU (PReLU) — The parametric ReLU knowns 
PReLU is another variant of the ReLU activation function proposed in 2015 
and the PReLU has the negative part of the function, being adaptively Teamed 
while the positive part is linear. The PReLU is given by 


ft [ x}, ifx;>0 
xi) = aixi, dix <9 
A ai is b negative slope controlling parameter and 
raining with back pri i i= 
a propagation. If the term a, = 0 the PRe 
The PReLU can be written in compact form as 
a f(x;) = max(0, xi)+ a; min(0, x;) a 
G enee; the performance of PReLU was better than ReLU in large* 5 
hi Be aici and these results from the PReLU was the first 1° s 
uman-level performance on visual recognition challenge- 


Q11. Explain the exponential linear unit (ELU) in detail % 


i ae The exponential linear units (ELUs) is another type of AF oe as 
in 2015, and they are used to speed up the training of deep neural në 


its leamablt 
LU becomes 
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gLUs is that they can alleviate the vanishing gradient 
antag entity for positive values and also improves the leasing 
g have negative values which allows for pushing of ace 
closer to ZrO thereby reducing computattonal complexity thereby 
arning speed. The ELU represents a good alternative to the ReLU 
inp" ses bias shifts by pushing mean activation towards zero during 
it 0e 


„gg of the 


nit (ELU) is given by 
A ifx>0 

f= (a exp(x)-1, if xs 0 

t of the ELU equation is given as 
ifx>0 

if x <0 


exponential linear u 


The derivative or gradien 


1, 
f= | F(x) +a, 
where a = ELU, hyperparameter that controls the saturation point for negative 
net inputs which is usually set to 1.0. 

The ELUs has a clear saturation plateau in it negative regime thereby 
leaming more robust representations, and they offer faster learning and better 
generalization compared to the ReLU and LReLU with specific network 
stucture especially above five layers and guarantees state-of-the-art results 
compared to ReLU variants. 


0.12. Define the term hyperbolic tangent function (tanh). 
M me hyperbolic tangent function is another type of activation function 
The tine vs learning and it has some variants used in deep learning applications. 
centred imie tangent function known as tanh function, is a smoother zero- 
on whose range lies between —! to 1, thus the output of the tanh 


uction is given by, 
e* —e * 
m= a 


mbane the preferred function compared to the Sigmoid 
* However a better training performance for multi-layer neural 
Provided o ered b tanh function could not solve the vanishing gradient 
= Y the Sigmoid functions as well. The main advantage 


byth 

e function į we 

opaga t ON is that it produces zero centred output thereby aiding 
ation Process 


The tanh 
nction j 
Retwor 
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A property of the tanh function is that it can only attain ‘ | 
only when the value of the input is 0, that is when x is z | 
tanh function produce some dead neurons during computati 
is a condition where the activation weight, rarely used as a feii 
gradient. This limitation of the tanh function Spurred further en) 
activation functions to resolve the problem, and it birthed ar 


‘ the rectitigg tT 
unit (ReLU) activation function. iej 


itni 
t VA 


The tanh functions have been used mostly in recu 


trent neural 
for natural language processing and speech recognition 


i 
Net 
tasks, vaj 
Q.13. Write short notes on the following — 
(i) Softmax function 
(ii) Softsign. 


Ans. (i) Sofimax Function — The Softmax 
activation function used in neural computing. It is 
distribution from a vector of real numbers. The S 
output which is a range of values between 0 
Probabilities been equal to 1. 
relationship 


function is another types} 
used to compute probabi 
oftmax function Produces: 


and 1, with the sum oft: 
The Softmax function is Computed using fy 


exp(x;) 
Eep 
j 
The Softmax function is used in multi 
Probabilities of each class, with the target cla 
The Softmax function mostly appears in al 
deep learning architectures, where they are 
The main difference 
is that the Sigmoid is 
for multivariate classi 


x)= 


-class models, where it retus 
ss having the highest probabili 
most all the output layers of tt 
used. 


between the Sigmoid and Softmax activation fave 
used in binary classification while the Softmax is 
fication tasks, 

_ G) Softsign — The Softsign is another type of activation in 
that ts used in neural Network Computing. The Softsign function was propor 
in 2009. The Softsign is another non-linear Activation function used in de 
learning applications, 


. vel 
by The Softsign function is a quadratic polynomial, & 
x 
fx) = | >= 
w ( x| al 
where |x| = 


Absolute value of the input, 
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en the Softsign function and the tanh function 
twe 


prerence be! es in polynomial form unlike the tanh function 
qe me aj eai Sily 
So ntially. B i 7 
is that in erges expone n used mostly in regression computation problems 
wot sign has Bee deep learning based test to speech systems, with 
the 5° en applied ot ising results using the Sofisign function. 
„has als be ting some promising 
e 078 repo rt note on weight and biases. 


oid. Write sho 


al neural network Mo: el is a data-driven model aimi 

ape k di 

An artifici 
ANS. ng to 


-o relationship between input and output data by training 

-ic the systematic “ e amount of data. It is composed of the information 

ve k based ah. rons, which are fully connected with different 

processing units a o of the relationships between input and output 
e 


to increase or decrease the net input of the 
data. Biases a 


i i eights 
or pl, accurately estimate the required output. Therefore, the weigl 
network cann 


3 i the so-called training so that the 
and biases'are continuously m ed St (observed) valde becomes small 
difference between the model outpu and | das the emaii] 
Totrain the network, the error function is define ee BP taliina aiad 
the differences. To minimize the error function, teas oe anal 
ll adient descent algorithm. However, 1 i gi s! 

generally uses a gr: : d it is sensitive to the initial weights 
minimum value of the error function and i e prio Tigne 
and biases. In other words, the gradient descent method is p sa si 
local minimum or maximum value. If the initial weights an E ium of 
fortunately selected to be close to the values that give the global T TR 
the error function, the global minimum would be found by the gradien sults. 
Asaconsequence of local minimization, most ANNs provide erroneous re 2 

van find the optimal initial weights and biases that lead into k. i 
minimum of the error function, a Monte-Carlo simulation is often si ata 
fin takes a long computation time. Moreover, even if we oduced by 
the wane iets and biases by the simulation, they cannot be repr 

al Users of the ANN model. 


O15, i 
Explain loss function and its types. 


ie 35 twarks, 
vig Lass function is an important part in artificial in oii @) 
4 + et! icted var D 
adac, Cd to measure the inconsistency between pre 


f model 
inerea label ( 


i stness © 

Y). Iltisa non-negative value, where the S Loss function 

d Soe With the decrease of the value of loss duane i component 
noie of empirical risk function as well as a significan 
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of structural risk function. Generally, the structural 
is consist of empirical risk term and regularizatj 


On te 
represented as — ics 


fa 
Which 


Ot = are min L£(8) +1.0(0) 


_ 1a ij at 
= argmin—) Ly g0 
gmi am (1,9) +2.0() 


ii 


i re (i) ‘ 
eas (i) 
arg min 3 Ly sf (x 8) +20) 


where (Q) is the regularization term or penal 


ty term, © is the 
> Par 
model to be learned. fC) represents th iy: 


e activation function an 


Hae AE Ojer" denotes the a training sample, 


Here we only concentrate on the empirical risk term (loss function) 


n P e 
£(8) = 21,64, 0) 
I 


and introduce the mathematical ex; 


nem pressions of several commonly-used los 
functions. 


Mean Squared Error — Mean squared error (MSE), or quadratic, los 
tion is widely used in linear regression as the performance measure, a! 
method of minimizing MSE is called ordinary least squares the bast 
i ciple of ordinary least squares is that the optimized fitting line should 
a line which minimizes the sum of distance of each point to the regressi 


line, i.e., minimizes the quadratic sum. The standard form of MSE los 
function is defined as 


func 
the 
prin 


1S a a 
Es TON = yy? 
i=l 


aii. 7 ctio 

: A -9) is named as residual, and the target of MSE loss m 

1s to minimize the residual sum of squares. However, if using Sigmoid fur 
nar : i 

as the activation function, the quadratic loss function would suffer the A wo 

of slow convergence (learning speed), for other activation functions. 

not have such problem. i 


where (y6) 


risk function g | 
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je, by BSiNE Sigmoid, we o(2) = a(9T x ), simply, we 
ex a sample, $Y; (y—9(z))”, and it derivative is computed by 
sider © 
aÈ _ _(y-o(2)).0'(z)x 
Jain the following terms — 


9.16. EXP (ii) Cross entropy loss. 


4) Hinge loss ` 
gr inge Loss ~ Commonly used loss function for the image 
Ans. @ es neural networks is the multi-class hinge loss, also referred 
‘cation ctor machine (SVM) loss. 


higher SCO. 
the multi-c 
i= JŽ max(0, Sj=Sy; +6) 
JFYi 
where’, denotes the loss for the ith training example and 6 is the fixed margin. 
i 


We can see that the hinge loss prefers the score of the correct class y, to be 
larger than the incorrect class scores by at least a margin 6. If this is not the 
case, loss is accumulated. Another related loss function is the squared hinge 
loss, which penalizes violated margins quadratically instead of linearly, and 
thus more strongly. The squared hinge loss can be formalized as ~ 


Li= J max(0,sj -sy +8)” 
Jtyi “ 

There are various alternatives of the multi-class hinge-loss such as the 
*e-Vs-all formulation, which trains an independent binary SVM for each cass, 
or the structured SVM, which maximizes the margin between the score of the 
correct clasg and the score of the highest-scoring incorrect runner-up class. 


es tion for image 
tassifica (i) Cross Entropy Loss — The most popular loss func 
ql 


Ee iai ltiple 
tion in neural n i tropy loss generalized to mu 
asses yi ‘etworks is the cross entropy 108s § ically, 
the co ê the softmax function and the negative log likelihood. Mathemati 


Cog, 
= entropy loss has the form 

Sy; S. 

yi j 

£; = -log = Ty +logLe 
Ps a 
De 
“here j 


i dey 
' notes the loss for the it training example. 
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Both hinge and cross entropy loss usually result į 
classification performance. However, unlike the hinge loss waa Tran 
outputs as uncalibrated - and possibly difficult to interpret - i ich tray, 
class, the cross entropy loss results in normalized class probabili ire, 
entropy loss has a rigorous interpretation in the domains of ies, t 
information theory, which is why it is the preferred choice of 
classification tasks in practice. 


mn 

p TObabili 
a 

Oss Function, 


(a) Probabilistic Interpretation — In the 
interpretation, the negative log likelihood of the correct class is min; 
which can be seen as performing maximum likelihood estimatio minim; 
we take L, weight regularization into account, we can interpret ielo } 
as having a Gaussian prior on the weights and we are thus emn Func 
a posteriori (MAP) estimation. The cross entropy loss can be Emainn 
normalized probability assigned to the correct class label given ewe k 

ay 


Probabili, 


where x; is an input with the corres i 
i ponding cl i 
parameterized by the model parameters. eee em 


theory a co ation Theory Interpretation — In, the informu 
cine ae , the cross entropy between a true distributi 
estimated distribution q is defined tind a true distribution p anda 


Hip, a) = — J p(x)logq(x) = H(p) + DKI) 
= Entropy KL divergence 
Cross entropy 


Pl oer oo a ai minimizing the cross entropy between the estint! 
distribution where all probeb ae pictis interpre‘ation S Y 
awo ka P probability mass iston the correct class, i.e. a vecht? 
to as one-hot enced o e at the yi" position In the literature this ÎS reir 
terms of ent ing. Moreover, since the cross entropy can be wrier 5 

ropy and the Kullback-Leibler divergence, it is equivale"? 


minimizin; < 1 
g the KL divergence — a difference measure - betwee! w 


distributions. Thi 4 
. Thus, th s «putin t 
have all of its mass e cross entropy loss wants the predicted distributi, 


r Ša A at the 
divergence is often used ae Correct answer. It has to be noted thet at 
n practice but does not correspond to a tue) 


metric since it does not i i a 
does not equal DaD the triangle inequality, and in penn 
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nin detail about the gradient descent. 
jai: ch is a firs imi 
wit. EP ae descent approach is a first-order optimization algorithm 
a local minima of an objective function. This has 


e # 
Se E ding the 
AnS od for fin NNs in the last couple of decades successfully, 


e loss with respect to the parameters are computed 
tion. they are usedito perform a gradient descent parameter 
three popular variants of gradient descent, which differ in 
we use to compute the gradient of the loss function with 
ork parameters. Depending on the amount of data, we make 
he accuracy of the parameter update and the time it takes 


dients of th 


ac 
There are 


the net’ 
between t 


— input x, labels y, parameters 0, learning rate n 


ae — trained model 
Jesel # current epoch 
2 repeat 
3 10 # total Joss 
4 for x;, yj € {X yer do 
5 ĝi < forward(x;, 8) # forward pass 
6 I, < loss(¥i, yi) # Joss calculation 
7 lelth; # update total Joss 
8 VaL; — backward(x;, 9. 9) i backward pass 
? 94} 0- n- Voti # gradient descent 
10 end 
ben # estimate epoch loss 
i e<ert+] 
Bu 


t convergence  — č — č o—Z oo mm 


ge the parameters along 
dicates the direction of 
mized and not 
es have the 


The si 
e negative aaam of parameter update is to chan; 
Crease, it A ient direction. Since the gradient indicates 
maximizeq negated such that the loss function is mint 
; Mathematically all three gradient descent updat 


follow; 
Slowing form 
9 -WVoL 


otes the learni S Is the step SIze 
aming rat rameter that contro!s thes 
Update, g rate, a hyperpa 


Wher 


ofa o den 


Single 
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(i) Batch Gradient Descent — Batch gradient descent compy 
gradient of the loss function for the entire training dataset x and labe ak 
the gradients have to be calculated for the whole dataset to perform ty, 
parameter update, batch gradient descent can be very slow and ig in an 
for datasets that do not fit in memory. Batch gradient descent i 


x 2 also do 
allow the model to be updated online, i.e. with new examples on thes i hy 


(ii) Stochastic Gradient Descent — In contrast, Stochastic 
descent (SGD) performs a parameter update for each training example y x 
label y;. SGD therefore performs redundant computations for large datas 
as it recomputes gradients for similar examples before each Parameter ungs. 
Its frequent updates have a high variance and can cause the loss Function , 
fluctuate heavily. The convergence of stochastic gradient descent has ber 
analyzed using the theories of convex optimization and d 


Stochagic 
approximation and has a high chance of reaching a global or local Minima, 


(tii) Mini-batch Stochastic Gradient Descent —Mini-batch stochas, 
gradient descent finally takes the best of both worlds and performs and update 
for every mini-batch of training examples xp and its Corresponding labels y, 
The hyperparameter B denotes the batch size, i.e. how many training examples 
are processed simultaneously to perform a single parameter update. Mini-bath 
SGD reduces the variance of the parameter updates, which can lead to mre 
stable convergence. Secondly, we can make use of highly optimized ax 
parallelized matrix operations which are frequently used insmodem dep 
learning frameworks. Mini-batch gradient descent is the de-facto standard 
algorithm for training deep neural networks and the term SGD is usually 


employed also when mini-batches are used, which can sometimes lead to 
confusion. i 


Q-18. Explain about multila iver neural networks. 


Ans. A typical multilayer artificial neural network consists of an in! 
layer, output layer and hidden 


networks are known as laye 
mappings ord 
by them. Fig. 
2.9 represents a 


= & £ F 
e e o E ce A 
SADO a 
PREDEA NE 


Oe Or 


s, and the activity of the neurons'in the hidden 
TS, 7 


e output we hidden layers are free to create their own 

pidde” ae neurons m this simple neural structure is interesting. Over a 
We pue °° fthe inpY k, multilayers neural network give an increase in 
i psentatio ral networ s there is a non-linear activation function between 
hye ry power unles ar activation function of each neuron, many 
: f Tae : : es as 
` potto e non-lin ks like nonlinear functional approximation, 
c is D eural networ 
by pilitie 
cal 


Y9 


Y2(0) 


Output Layer 
d Multilayer Neural Network 


Input Layer Hidden Layer 


Fig, 2.8 Representation of Three Layere: 


[8 }—[=} -B a 
Input Layer Hidden Layer Output Layer 


Fig. 2.9 Simple Block Diagram 
0.19. Explain the neural network architectures. 


, consists of a 

Ans. An artificial neural network is a data processing at een 

eenumber of highly interconnected processing elements (art! sed vail 

Man architecture inspired by the human brain. An ANN is repre: ee 

‘Staph. On the basis of learning mechanism, there are several classe eae 
network, In general, we may identify three fundamentally different 


Nel 
twork architectures. 


i) 
Ti layer and 
er neurons 


RGPV, June 2014) 


two layers, 
Single Layer Feedforward Network — pee se Tie iol 
output layer, in single layer onesie s neitrons receive 
S receive the input signals and the output Jay: nnect every input 
to penals, The synaptic links carrying the weights sf network is called 
© Output neuron but not vice-versa. This tyPe° ied single layer 
ard in type or acyclic in nature. This network is ANSE 8 
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because it is the output layer alone that performs computati 
sends the signals to the output layer, thus the name single gh ein 
network. An example of single layer feedforward network iş ayer f 
where x; represent input neurons, Yj represent output n shown; 
weights. eurongs and tly 


Input Neurons Output Neurons 


Fig. 2.10 An Example of a Single Layer Feedforward Network 
a s; : 

(ii) Multilayer Feedforward Network — The multilayer feedforwar 
ture besides 


Input Layer 


Hidden Layer 


Output Layer 
Fig, r 
‘8. 211M ultilayer Feedforward Network 
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w 
gs Eni i 
in with! Aen layer, and n output neurons in the output layer is written 
1 


con ilayer i 
` pe Se multilay' f ; tis 
in i omy 2i in which x; are input neurons, y; are hidden neurons, z; are 
in fig." 
pt neurons: ae 
rent Networks — A recurrent network distinguishes itself 


a) Recurl n : 

yh ard neural network in that it has at least one feedback loop. 
a fee e networks, there could exist one layer with feedback connections 
fig. 2:12. There could also be neurons with self-feedback links 


situation where the output of neuron is fed back into its own 


Thus, ia these 
gillustrated in 
which refers to @ 


input. 


Fig. 2.12 Recurrent Network 


Q.20. What is back propagation ? 


nme a the most important developments in neural network is back 
for training eae algorithm. The error back propagation algorithm is ee 
So that the aie ae Perceptrons (multilayer feedforward neural netwo Dy 
Set of input-out a be trained to capture the mapping implicit in the given 
Sneralization i pattern pairs. The back propagation algorithm is 
Weights to iti, e least mean squared algorithm that modifies networ 
Puts of g e lize the mean squared error between the desired and actual 
“scent along Pile The approach to be followed is basically a gradient- 
a efined i crror surface to arrive at the optimum set of weights. The 
o $ si Squared differences between the desired outor and gi 
attern eae a the output layer of the network due to annie se 
ipu Pe Vised fo tie given input-output pattem pair. Back A 
e lia Well as desi ing in which the network is trained using data for aie 
E can | ired outputs are known. Once trained, the network a ; 
© used to compute output values for new input samples. 
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The feedforward process involves presenting an input Pattern». 
he KEA rant values on to the first hidden layer Ra i L 
re) uro = P n wi e - ic 
curiae nodes computes a weighted sum of its inputs, Passes n Tk 
ene its activation function and presents the result to the output 
ai >. 

The back propagation algorithm assumes a feedforward ney nehy 
architecture. In this architecture nodes are partitioned into layers num it 
os L, where the layer number indicates the distance of a node ftom the iy 
Stes The lowermost layer is the input layer numbered ag layer 0, W 
topmost layer is the output layer numbered as layer L. Back Propagat 
addresses networks for which L > 2, containing “hidden layers” Teg 
ItoL-l. 

0.21. What are the advantages, disadvantages and applications of ty, 
propagation network ? 

Ans. Advantages — Following are the advantages of back propagation 
network — 


8 
layer 


G) The computing time is minimized if the weights chosen are sml 
at the beginning. 

(ii) The mathematical formula of back propagation can be apple! 
to any network. 

Disadvantages — Following are the disadvantages of back propagation 

network — 

(i) [thas more number of learning steps, and also the learning plat 
has intensive calculations. S 

(ii) The training may sometimes cause temporal instability to " 
system. 

Gii) The network may get trapped'in a loča] minima. - 

Applications - Back propagation has been usin a wide variet 

applications, some of them are as follows — 

(i) Data compression 

(ii) Image compression 

(iii) Face recognition 

(iv) Optical character recognition 

(v) Control problems 

(vi) Non-linear simulation 

(vii) Fault detection problems 

(viii) Load forecasting problems. 
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pitecture of back propagation network. What are 
i 


a i ? (RGPY, 
2 pisewss y pack propagation network? (l Dec. 2017) 
QF used Í 


gions -on neural network is a multilayer, feedforward neural 
w ctv ack propaga on input layer, a hidden layer and an output layer. 
AnS Ab made UP of an d output layers contain biases. These biases are 

ft is n 7 


: a oe 
(eo <in the es whose activation is always 1. Also, the term 
The penro tions from aunts e architecture of a back propagation network 
pe 00” pi as weights. tion of information flow for the feedforward phase iy 
p only the direc Is are fed in the opposite direction during the back 
ie fig. 2-13- a? The inputs are sent to the BPN. The output 
f i p 
a jon phase z could be either binary (0, 1) or bipolar (-1, +1). 
p the nel 
sehieved from ; 
yai 
tk 
— 
tm 
J 


c itecture 
Fig. 2.13 Back Propagation Network Architec ; 
network are binary 
hese functions are 
jiy, G) continuity, 


foie a vation functions used in the back piopi na 

Useg tes ‘nd bipolar Sigmoidal activation eae 

ti) gp tothe following characteristics — (i) differenti 
Screasing monotony. 


. The pj , 
“enoidat EY Sigmoidal function range is betw 


aa Tange is between —] to +1. 


ae, 


een 0 to l and bipolar 
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@.23, How does back propagation work ? 


Unity a9 


Rey,» | ph C 
Ans, The network learns a predefined set of input — output s an, 
i ase Bene: ter an į x > l Network Architecture 
beans eugene propagate ~ adapt cycle. After an ‘put Patter, tt Input: Nad Training Patterns 
applied as a stimulus to the first layer of network units, it 


i 5 
3 : Is Propagat. d i 
each upper layer until an output is generated. This o 


compared to the desired output, and an error signal is com 


Network and Initialize its free} 


Form A Ne 


Is y ” 

: NPULCA for gay bey 

unit. The error signals are then transmitted backward from the outpu Uy 
u 


each node in the intermediate layer that contributes directly to the 
However, each unit in the intermediate layer receives o 
total error signal, based roughly on the relative contributi 
the original output. This process repeats, layer by layer, u 
network has received an error signal that describes its rel. 


meters i.e. Synaptic Weights Randomly 
Para - 


Out 
nly a Portion of k 


ion the unit Madey 
ntil each node in ty 
lative Contribution y 
tion weights are they 
© toward a state that 


the total error. Based on the error signal received, connec 
updated by each unit to cause the network to converg 
allows all the training patterns to be encoded. 


Forward Pa’ 
Instantaneous 
it to Update 


The significance of this process is that, as the network trains, the nodesin 
the intermediate layers organize themselves such that different nodes leam t 
recognize different features of the total input space. After training, when 
presented with an arbitrary input pattern that is noisy or incomplete, the unis 
in the hidden layers of the network will respond with an active output if te 
new input contains a pattern that resembles the feature the individual unis 
leamed to recognize during training. Conversely, hidden-layer units havea 
tendency to inhibit their outputs if the input pattern does not contain the feae 
that they were trained to recognize. 


As the signals propagate through the different layers in the sie 
activity pattern present at each upper layer can be thought of as a oe 
features that can be recognized by units in the subsequent layer. ae ' 
pattern generated can be thought of'as.a feature map that providesan n enn oe 
of the presence or absence of many different feature combinations on : meats 
The total effect of this behaviour is that the BPN provides an pedir 
of allowing a computer system to examine data patterns that may be me 


Fig. 2.14 


S. What h " , ‘k propagation ? 
earning rate should be used for bac ao e Dec. 2008) 


opagation 


de The choi. P r i saben 
“orth € choice of learning rate is a tricky task in back p 


` the A nein > rapid trang 
ends on © range of learning coefficient that will produce fap 


sali fate le 5 ia single learning 
or noisy, and to recognize subtle patterns from the partial input. ih | cSttable ge Mber and types of input patterns, there is no single learning 
ral network" 1 "icien h ordi Srentiraining cases. An empirical formula to select Ie 
0.24. Explain the working of back propagation R GPI. June? 3S been suggested by Eaton and Oliver in 1992 is given a 
neat architecture and flowchart. (RGE iP 
q= — 
Ans. Refer to Q.23 and Q.22, 


N? + N34...4 Ni) 
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where N; is the number of patterns of typel and 


Mis t 
pattern types. i Umber Of ip, 
i 


Selection of a value for the learning rate Paramete 

effect on the network performance. The learning rate F has a Signin 

because this would cause the change of weight Cannot be pe 
ideal weight vector position. If the lear 
take place and hence, the learning coe 
coefficient is equal to 2 then the net 
coefficient is greater than 1, the weig 
position and oscillate. Large values for the learning rate are us s 
data patterns are close to the ideal otherwise smal! value ed When ny, 
nature of input data patterns is not known, it is better tou rahe 
Usually, 1 must be a small number — on the order of 0.0: 
that the network will settle to a solution. A small value 


ning coefficient js zero, 4 
fficient must be e 
work is unstable and if th ani 
ht vector will Overshoot from tt 
re 


ly fy 
Positive, 1p ee 
e | 


Sed. ty, 
Se moderate Valu 


5 to 0.25- to eng 


Of) means th 
alik 
but that : 


2.26. What do you understand by deep neural network ? 


Ans. Neural networks (NN) are powerful, statistical learning modes i! 
can solve complex classification and regression problems. A neural networks 
composed of interconnected layers of computation units, mimicking * 
Structures of brain’s neurons’and their connectivity. Each neuron includes 
activation function (i.e., a non-linear transformation) to a weighted sum 
input values with bias. The predicted output of a neural network is calcul n 
by computing the outputs of each neuron through the network layers inal 
forward manner. In fact, deep neural networks, which are the backbot, 
deep learning use a cascade of multiple hidden layers to increase exponer 
the learning capacity of neural networks. Deep neural networks arè built ° 
differentiable model fitting, which is an iterative process during W b es 
model trains itself on input data through gradient based optimization ® 
making small adjustments iteratively, with the aim of refining the m 
it predicts mostly right outputs. The principal components of a neur 
are as follows — 


iy 
vector to move e Deta W ey 


exity t0 
ME oplet 


Units 9 
It represent the trained weights and biases use 


ha d by 
gte! “ ations. 4 
f) patt ir internal calcula 
smake t _ jt represent non-linear functions that add 
oe jyations - at indicating, fundamentally. if, ad 
"AN output aimed atin > Y- ifa neuron shoulg 


Function- It consists of a math function, which estimates 
dicted and actual outcomes. If the DNN's Predictions 
otherwise, the loss is greater than zero, 


ey Loss 
i 
ii) e 
pance netwer, p 
a ci, the loss #8 2°°- P F ; \ 
epee nlarization Tt consists in techniques that penalize the model's 
i) E i overfitting such as L1-L2 regularization or dropout, 
P. 


timizer — It adjusts the parameters of the model iteratively 
«othe objective function), in order to — (a) build the best-fitted model 
(ratuetng j ss; and (b) keep the model as simple as possible i.e., strong 
ite von. The most used optimizers are based on gradient descent algorithms, 
val 3 


wi) Hyperparameters — These are model's parameters that are 
tänt during the training phase and which can be fixed before running the 
fiting process such as the number of layers or the learning rate. 

When training a deep neural network machine learning developers set 
hpeparameters, choose loss functions, regularization techniques, and gradient 
bsed optimizers. After training the model, the best-fitted model is evaluated on 
itesting dataset (which should be different from the training dataset), using 
Smnileand accuracy measures. The occurrence of errorsin the training program 
“4 DNN often translates into poor model performance. Therefore, itis important 

“nsure that DNN Program implementations are bug-free. Given the large size 


w) Op 


Othe testi f ies ATE 
earn Space of a DNN, systematic debugging and testing techniques are 
to assist developers in errors detection and correction activities. 

0.27, 


i Write down the strength and weakness of weight initialization. 
ins, ` 

The weight initial 
() We can im 
mal initializat 


ization process has the following strengths 


"at oppi; of an FNN using 


prove the classification performance 
ton of the connection weights. 


(ii) Siete 
te 
orks i ane convergence of gradient-based methods on well initialize 


© Weight iste faster than on randomly ipeja in 
cat ti) There is zation process has the following enon 
bug pden a giy a lack Of statistical models in the ee ae 
a give ba Weight initialization method could give ` 
e nupp D Many. ning sample and network architecture- : 
Mber Stage the existing methods are computationally © 
“ction weights, being unfeasible for large net™ 


could 


ory 


pensive 


on orks. 
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0.28. Explain the term training, test and Validation Sets, 
Ans. Choosing the right dataset for a 

importance for the generalization perfo 

to classify unseen data correctly is h 

underlying dataset itself. For an imagi 

input X consists of a variety of 

corresponding ground truth labels y. 

tensor of numbers with c color chan 

w. For a color image, its three chan: 

(RGB) intensities at each pixel. Ci 

channel corresponds to the graysc 


given Classifica 
mance. Ultimate} 
ighly dependent 
e Classification pi 
grayscale or col 
An image X; can by 
nels and Correspon: 
nels correspond to 
onversely, for a 
ale intensity. ik 
For any supervised machine learning task, it is COMMON to split the ey 
dataset into three subsets called the training set, the test set, rl 


tion 
on 
e viewed as 


ding hei 
the re 


i and the valid, 
set. They each serve a different purpose and are motivated as follows 
Parameters Hyperparameters Generalization 


Validation 
Set 


Fig. 2.15 Schematic Overview of Training, Validation and Test Set 
Training Set — It is 
model 
betwee: 
empiri 


poanneusueaneveq 


typically the largest ofthe three and is used to finde 
parameters that best explain the underlying predictive relations 
n the data and its labels. Most approaches that fit parameters = 
cal relationships with the training set alone tend to overfit ae 
meaning that the can identify apparent relationships in the training 

fail to do so in general. This motivates the use of a test set. 


oy 5 ning, but follor 
Test Set — This is a set ofdata that is not used during training, 


fo 
the same probability distribution and predictive relationship. Ifa unde 
well on the training set and also fits the test set well, i.e- ae as take? 
label for a large number of unseen input data, minimal el one” 
place. It is important to note that the test set is usually only a specife?” 
soon as the model’s parameters and hyperparameters are ful J owen G 
order to assess the model’s generalization performance- jidat" 


. av 
- a i ining. ? 
approximate a model’s predictive performance during tra 
set is used. 


e 
f, 


Ta 
grayscale image, ihe i 


:ç is created by splitting the training s 

re monty used for hyperparameter op 
~h is CO! y i 
er roupi early stopping. 

n Z | i 
a ourable hyperparameters and maximize the 

‘av y i | 

ose fi a validation set is used as 
e 


„gation © 
wt of whi 
ji 
eft 


Umization ang 


model’s 
an additional dataset for 
used to fit the hyperparam 


eu 
„erality- T! 


cgidation- P j Ily applicable attempt to find the optimal time 
ing is auniversally 
Early stopping 


ining processiand to decide when a model is fully specified with 
nope Meters and hyperparameters. A common technique is to 
eet r ah, the so-called validation accuracy and to stop training 
k e: improving. The validation accuracy can be seen as an 
We sion of the final classification performance that imposes minimal 
alll on the data’s predictive relationship, both theoretically and in 
practice. 


0.29. Describe the unstable gradient problem. 


Ans, Due to poor weights initialization and bad choices of hyperemia 
D\Nscan be exposed to the unstable gradient problem, which could ma ca 
nthe form of vanishing or exploding gradients as described below — 


‘alt values when it is back propagated through the hidden layers = 
Nbc eed be transformed to undefined ee during discrete 
"Ccutiong ) caused by underflow rounding precisie an lead to the 

n hardware, The problem of vanishing gradient Ea | instability. 
Ne training process and eventually causing a numerical! Sigmoid 
aiga on, we take the example of a DNN configured to pats 
Mion ytstion and a randomly initialized weight using a cald 
Seton Vth a zero mean and a unit standard deviation derivative of 
Wha 2 Maximum derivative value of 0.25 (.¢- the the weights 
© input is equal to zero) and the absolute value cil tt. 
an 0.25 since they belong to a limited range ai smputed by 
Sum of n Bradient ofa given hidden layer ina NN would at and the 

ucts of all the gradients belonging to the deeper 'a 


ESS th: 


@ Vanishing Gradient — In this case, the gradient tends to 


a Gaussian 
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weights assigned to each of the links between them. As a result, th 

gradient is back propagated over the hidden layers, the MOTE there» hey 
of several terms that are less or equal to 0.25. Hence, it ig parent i 
hidden layers (i.¢., closer to the input layer) would have Very Jess tl, 

would be almost stagnant with less weights changes during the train; 
in 

A ty 
gradient phenome, l 
respect to the eatin k 
quence, this Could ies, 
Previous DNN xangy 

ù R, 

case the Parameters R : k 
€ of the sigmoid keepin a, 
hy 


lodes and eventually becar 


(i) Exploding Gradient — The exploding 
be encountered when, inversely, the gradient with 
diverges and its values become huge. As a conse. 
the appearance of —/+ œ values. Returning to the 
same NN can suffer from exploding gradients in 
in a way that their products with the derivativ: 
the higher side until the gradient value expl: Ù 
numerically unstable. Ms 
2.30. Write short note on autoencoders, 


Ans. Autoencoders are simple learning circuits which aim to tra 
inputs into outputs with the least possibi 

conceptually simple, they play an imp 
Autoencoders were first introduced in the 


to address the problem of “back 
input data as 


nsfom 
€ amount of distortion, Whik 


ortant role in machine learnin 
1980s by Hinton and the PDP 


propagation without a teacher” 
the teacher. Together with Hebbi. 


gop 


biochemical event 


s can be coordinated in 
global learning 


and intelligent behaviour, 

More recently, autoencoder: 
architecture” approac 
Boltzmann ma hin 
unsupervised f 
la: 


a self-organized manner to prodit 


s have taken center stage again in the ie 
h where autoencoders, particularly in the form of reste 
es (RBMS), are stackedlland trained bottom Le 
ion, followed by a supervised learning phase to train ee 
yer and fine-tune the entire architecture. The. bottom up phase is ale 
with respect to the final task and thus can obviously be used in transfer - 
approaches. These deep architectures Haye been shown to lead to ae’ 
art results on a number of challenging classification and regression pr 


Q.31. Describe a general autoencoder framework. 
Ans. To derive a fairly 


5 shi? 
general framework, an n/p/n autoencoder4 
in fig. 2.16 is defined by a t-uple n, p, m, F, G, A, B, X, A where- 


(i) F and G are sets. 
(ii) n and 


eco 
marily thee 
E i" : marily 
Pare positive integers. Here we consider pri 
where 0<p<n, 


Uniti 99 
lass of functions from G? to pr, 
a clase 


La 
A 
Cad 
in 
ss: 
Fig. 2.16An nip/n Autoencoder Architecture 
ig. 2. 
) Bis a class of functions from F” to GP. 
(iv 


Gi) A . 


x =o Xml is a set of m (training) vectors in Pa, ae 
rt N present we let Y = (yj, ...... Ym) denote the Corresponding 
sternal targi 7 
A of target vectors, in BE l 
sl (vi) Ais a dissimilarity of distortion function (e.g, 


Lp norm, Hamming 
distance) defined over F”. 


ForanyA€ A and B € B, the autoencoder transforms an input vector 
xe F" into an output vector Ao B(x) € F” (see fig. 2.16). The corresponding 


autoencoder problem is to find A € A and B e€ B that minimize the overall 
distortion function — 


m m 
in E(A, B) = min Y` E(x )=min > A(AB(x,).x,) 
min E(A, B) eee (x4 ee 


Inthe non auto-asso 


a ciative case, when external targets y, are provided, the 
mnimization proble 


m becomes — 


m m 

min E(A,B) = min X` E(x, .¥,)=min Yards Bix;), Y) 
ABI bia 

< n corresponds t 


o the regime where the autoencoder tries to 
Some form of 


compression or feature extraction. The case p è n is 

ite the end of the paper. ae a 

Wad ie this general framework, different kinds or int nit 

ation ¢ “Pending, for instance, on the choice of wie pie 

onal ¢ “Sses A and B, distortion function A, as well as the p i 
°nstraints, Such as regularization. To the best of our knowledg 


ork 1p as a 
autoe i ’ group 
© OF th: toencoders Were first introduced by the PDP g Rand 
B come.) definition, 
“Ponding to 


o% =G 
With all vectors components in F 


ee followed by 
Matrix multiplications follev . 
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. 2 $ 
; : i ations with an L5 error function, [F 3 
Sigmoidal transformat 2 it OF regressin 


the non-linear Sigmoidal transformation is typically use 
layer]. As an approximation to this case, in the next s 
linear case with F = G = R. More generally, linear autoencoders Co, M 
the case where F and G are fields and A and B are the elhas Do 
transformations, hence A and B are matrices of size p xn and of 


NX P reg 
0.32. What do you understand by batch normalization 2 res 3 

Ans. Batch normalization is a recently popularized method for Sie 
deep network training by making data standardization an integral Mie, 
network architecture. Batch normalization can be seen as yet aa 
that can be inserted into the model architecture, just like the fully Panke 
convolutional layer. It provides a definition for feed-forwarding the ties 
computing the gradients with respect to the parameters and its own input, 
backward pass. In practice, batch normalization layers are inserted a, 
convolutional or fully connected layer, but before the outputs are fed into. 
activation function. For convolutional layers, the differe: 


nt, elements oft 
same feature map — i.e. the activations — at different locations are normalis 


in the same way in order to obey the convolutional property. Thus, all activatie 
in a mini-batch are normalized over all locations, rather than per activation 


Clg, 


The authors of batch normalization claim that the internal covariate sift 
is the major reason why deep architectures have been notoriously slow to tri 
This stems from the fact that deep networks do not only have to leaman 


representation at each layer, but also have to account for the change in thet 
distribution. 


The covariate shift in general is a known problem in the machine leant : 
community and frequently occurs in real-world problems. A common one 
shift problem is the difference in the distribution of the training and m 
which can lead to suboptimal generalization performance as Sane 
2.17 (a). However, especially the whitening operation is computato 
expensive and thus impractical in 'an-online setting, especially if the o 
shift occurs throughout different layers. 


+: 
. opytat 
The internal covariate shift is the phenomenon where the eee 
network activations change across layers due to the change vers 
Parameters during training as shown in fig. 2.17 (b). Ideally, each tu 
be transformed into a space where they have the same Lars tation” 
functional relationship stays the same. In order to avoid costly C24 cand 
covariance matrices to decorrelate and whiten the data at every nie 
We normalize the distribution of each input feature in each layer 


mini-batch to have zero mean and a standard deviation of one- 
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~ 


x Layer 2 


Test Set 


Training 


1 
i 
H 
| 
Set 
f 
I 
f 
I 


zate Shift (b) Internal Covariate Shift 
ie 17 Govariate Shift vs. Internal Covariate Shift 
Fig. 2. — During the forward pass, we compute the mini-batch 
Forward Path these mini-batch statistics, we normalize the data by 
an and dividing by the standard deviation. Finally, we scale 
qbracting the a. the learned scale and shift parameters. The batch 
"i Ca h W pass fgn can be described mathematically as 
normati 


(a) 


men and V: 


x) _ yD +p 


ri Hn is the batch mean and oR the batch variance, respectively. ve 
i Scale and shift parameters are denoted by y and p, respectively. For 


s svati, mit 
n We describe the batch normalization procedure per activation andio 
oesponding indices, 


a el 4 

Mo these k malization is a differentiable transform, we can ss. oer 
Wet of ice Parameters and are thus able to restore the a fe ave 
nd sh network by learnin g the identity transform. Converse ra bert 
a Parameters that are identical to the corresponding bate H ifthat 
Optimas ‘ation transform would have no effect on ale ei 
dey Placed py Pe™tion to perform. At test time, the batch mean 4 cen 
A © respective population statistics since the pete ‘ng 
Samples from a mini-batch. Another cel ose there 


ke Othe 
T 

P ramin ve ining an 
tages of the batch statistics during training a 
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The bias is a measure of how much the network Output, ay 
possible datasets differs from the desired function, The variag, et Oe, 
of how much the network output varies between datasets, Ea i 
the bias is large because the network output is far from 
The variance is very small since the data has had little influen 
parameters yet. Later in training, however, the bias is small because ed 
has learned the underlying representation and has the ability to a Det 
the desired function. If trained too long, and assuming the model ai 
representational power, the network will also have learned the noi 
to that dataset, which is referred to as overfitting. In case ov 
poor generalization, and the variance will be large becaus 
between datasets. It can be shown that the minimum to 
when the sum of bias and variance are minimal. 

In a neural network setting, the loss function is 
contain many different local minima. During training, we want to reach aly 
minimum that explains the data in the simplest 


Possible way According y 
Occam’s razor, thus having a high chance of generalization, 


rly į 
sired g tiy 
Ce on pet 


Ses 
erfitting we 
e the Noise y 
tal error will S 


i, 
ha 
not convex and may ths 


.36. Write short note on early stopping. 
pping. 


Ans. Early stopping is the easiest and 
the solution and to aid generalization 
accuracy on the 
is stopped once 
performing mo 
‘on the saved o 


most effective method for Tegularizing 
performance. Simply put, the validatia 
held-out validation set is continuously monitored and the traine 
the accuracy stops improving. In practice, one can savethe bs 
del parameters in addition to the current parameters and fallbed 
me once further improvements seem unlikely. 


Q.37. Explain in detail the term weight regularization. 


. . . bee we 
Ans. A common practice is to introduce an additional term to the loss ae 
such that the total loss is a combination of data loss and regularization loss 


Low) ==), n 


AR(w) 
: ee 
JEL) Regillarization 
Data loss loss 


where £ can be any loss function, R(W) is the regularization penalty 
hyperparameter th: 


‘at controls the regularization strength. 

The intuition behind weight regularization is therefore to pt 3 
weights, and thus the local minima which have a simpler soa prac 
network setting, this technique is also referred to as weight pee its bie? 
regularization is only applied to the weights of the network, Be a 
This stems from the fact that the biases do not interact with e 00 
multiplicative fashion, and therefore do not have much influence 


and his? 


fer sol 
inane” 


pi” 
oo 
è 
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penalty can be defined in a number 
= 3 


jzatio 5 
„gulariz? s follows — 


ores >a 
qhe? -ch are ; : 

= egularization — Lasso or L, regulariz 
fore sparsity. It is often encountered 
i.e. the sum of absolute values as į 


nm 
{wl 
w) = [vl = 2 Xin 


OF Ways, the most 


ation Encourages 
and computes the 
n 


Rjassol i= 


idge Regularization — Ridge or L, regularization 
(ii) Ri : the most popular choice of weight regularizatio 
weights. Iti norm of the weights, i.e, the sum of Squares 


encourages 
nin practice 


as in 
sicomputes the L2 


n m 
2 2 
=w=} ow; 
Rpidge@) = I w ll 22 ij 


iii) Elastic Net Regularization — Elastic net regularization combines 
Og een Elastic net tends to have a grouping effect, where 
W: input features are assigned equal weights. It is commonly used in 


practice and is implemented in many machine learning libraries, Elastic net 
regularization can be formalized as — 


Retastic(W) = a || wli +1 —@) fl w13 


where the amount of L, or L, regularization can be adjusted via the 
‘yperparameter a. € [0, 1]. 


. i i l 
A two dimensional geometric representation of the weight regularization 
methods is depicted in fig. 2.19. 


Fig, 219 


w 


iia eation Methods 
“COmetrical Interpretation of Weight Regularicane 
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0.38. Discuss the significance of momentum in back 

Propagat, 
gs 4 S 4 Methoq f 
the training time of the back Propagation algorith 


it is “remembered 


Chg, 
The and se 


adj <a, 
adjustment Equation, 
s 


Aw qk (n+) = (6 KOUT, j) + a[Aw 


pq k 9] í 
af 
Wog x(n +l) = Wag k(n) + AW yg (n+ 1) 


A P wall 
where a, the momentum coefficient, is commonly set to around 09. 
Using the momentum method, the netwo 


narrow gullies in the error surface (if they e 
from side to side. This method seems to wo 
has little or negative effect on others, 


Sejnowski and Rosenberg describe a similar method based on exponen 
smoothing that may prove su 


perior in some applications. 
AW yg 4 (0 +I =aA Wg kD) +(1 ~0)5 4, OUT, ; 


Then the weight change is computed — 


rk tends to follow the bottom ef 
‘X1St) rather than Crossing rapid), 
Tk well on some Problems, bui 


(tl 


= 7 „MM 
Wogk (n+ 1) Wpg.k (n) + NAW pg.k + 1) 


where a is a smoothing coefficient in the range of 0.0.to 1.0. If a, is _ 
smoothing is minimum, the entire weight adjustment comes from the new! 


calculated change. If a is 1.0, the new adjustment ignored and the previos 
one is repeated, 


ule. 
Q.39. Define the momentum vs. Nesterov.momentum update r! 7 
st 


Ans. Momentum update is an optimization approach that iiin z 
convergence rates of stochastic gradient descent in deep nerve oft 
momentum update rule can be motivated from a physical aceite heit 
optimization problem, In particular, the loss can be interpreted squivlet v 
of a hilly terrain and the optimization process can be seen as wiii on? 
simulating a ball — i.e. the parameter values — rolling down iy dire t 
landscape. The difference to an SGD update is that the gradien! sition. 
influences the velocity, which in turn has an effect on the pi 
momentum update rule can be expressed as 


Ve pv Nn. VgL 
Ə Oty 


Jocity that accumulates the gradients oye, time and th 
ont Se referred to as momentum and Consistent with the physical 
eter H äi 


entum — Nesterov momentum is another Popular upda 
om accelerated gradient and Proposes ag ons 
ronger 


M i 
eer Mog by Nestrov's 


patis IPSP rune regular momentum update. It enjoys s 
& sesjon O tees for convex functions and performs 
gee ace guaran! ractice. The core idea is that when th 
ove ate ie we Know that the momentum 
eal at vector by uv. Therefore, 
cl poof @ + pv as a look ahead as in 


ve py Tm: Ao pE 

A N the regul 

j correspond to the ones of he regular momentum update, 
shere mie a, is evaluated at @ + uv instead of the old position 6, 
cept 


Avisual comparison of regular momentum and Nesterov momentum can 
v 
t obtained in fig. 2.20. 


theoretical 
better than Tegular 
e Current, er 
erm alone is Slightly 
We can treat the future 


Lookabead 


Momentum 
Step 


Momentum 
Step Actual Step 


Actual Step 
Gradient Step 


(@) Momentum Update (b) Nesterov Momentum Update 


Fig. 2.20 Momentum vs. Nesterov Momentum Update Rule 
9.40. What is hyperparameter ? Explain hyperparameter selection. 


> set by the 
E The parameters of the convolution layer which need to be se 

( ~ On cross-validation or experience) prior to the filter learning sa 
ey en ~and padding) are called hyperparameters. These hyperpat i 


e _ é ` ç architecture based on 
tives "preted as the design choices of our network architecture 
Plication, 


“Parameters are very important concepts in machine eer 

© Context of neural networks since these types ofret wrt 

i ee Mtuitively, hyperparameters can be seen as sea ráji 

a “Pendently of the mode] parameters, and are typical arameters 
Ming Process begins. The most important global hype 


stochastic 
n ; 2 n and stochas 
descent “ural networks trained with backpropagation 
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Hyperparameter selection is usually performed with one of 
techniques ~ manual search, grid search, and random eani., t 


Manual search essentially refers to picking one hy 
intuitively and testing its performance on the validation set 


Grid search refers to a technique were multiple hype 

the learning rate and the batch size are picked within fixed int 

aie pk te 
sizes in order to maximize performance on the validation set 


Random search refers to hyperparameter selection with fi 
but random step sizes and has several favourable properties a 
manual or grid search. A further investigation of these propert: 
the scope of this thesis, and the reader is referred to for a rigorous 
hyperparameter selection methods. 


e foly, 


Parameters « à 
Tvals and š 
h 


jes ig 


Gti 


Perparim, 
iy 


xed inten, 5 
saiba L NEURAL NETWORK, FLATTENING, 
Prosci convoLU TIO DING, STRIDE, CONVOLUTION LAYER, 
a asMPLIN ' LOSS LAYER, DENSE LAYER 1x1 
Ysi | gl NG LAYER, ; 

pooLING © CONVOLUTION 


gil. Write short note on convolutional neural ipa" TE 
Ans Aconvolutional neural network (CNN, or ConvNe ) is a clas 
dep neural networks, most commonly applied to analyzing visual imagery. 
CNNs'use a variation of multilayer perceptrons designed to require 
nipimal preprocessing. They are also known as shift invariant or space invariant 
licial neural networks (SIANN), based on their shared-weights architecture 
a(translation invariance characteristics. 
Convolutional networks were inspired by biological processes in that the 
wmectivity pattern between neurons resembles the organization of the animal 
‘al cortex, Individual cortical neurons respond to stimuli only in a restricted 
itm ite visual field known as the receptive field. The receptive felig 
heii Partially overlap such that they cover the ere ‘aah 
ssification cela little pre-processing compared be aba 
"ational deri ms. This means that the network lane a oat ote 
“DWledge asi he ms were hand-engineered. This indepen len a 
__ they have man effort in feature design is a major advatt ag i si 
Stems applications in image and video recognition, recomme: 


S, Imag i 
> Mage ` : à uage 
"essing & classifications, medical image analysis, and natural languag 


Desig 


n 
“u A conv: 


A * 
olutional neural network consists of an japut ad 
llas multiple hidden layers. The hidden laye™ pra ao 
ayers, fully op tional layers, RELU layer. i-c-. activation functor 
criPtion = Connected layers and normalization Jayers. à 
on, of erworks !S by 

oluunon 


ùn esi 
Neng a 
the process as a convolution in neural n 


ematical, o> A 
atically it is a cross-correlation rather than ? con 
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(although cross-correlation is a rel 


nity 117 


Man 


Wurst 
ae SI a neural 
‘ ang incremental adjustments 10 the biases ang 
i gh cross y lated operation). This only , È by makir onal the bias are called a file d repre 
for the indices in the matrix, and thus which weights are placed as Sieni ¿e5 of weights ` 

A : i at whin, ig fo „0.4 pi $ 
Convolutional — Convolutional layers apply a convoluti Which ing’ a input (CE0 thesame filter. 
the input, passing the result to the next layer. The convoy? a y share 


uli. 
ta only for it 
networks can 


ical to apply 


E 
= 
= 


S Te 
pti 
be u 


, x e necessary, even ina h Etg 
(opposite of deep) architecture, due to the very large input sizes Shall, 
with images, where each pixel is a relevant variable, For instance . 
connected layer for a (small) image of size 101 


y © * 100 has 10000 weg 
each neuron in the second layer. The convolution operation brings a Si 
to this problem as it reduces the Solty 
to be deeper with few 

tiling regions of size 5 
25 learnable parameters. In this way, 
gradients problem in training traditiona 
layers by using back Propagation. 


Pooling — Convolutional networks ma‘ 
layers. Pooling layers re i 


max or an average. M: 
cluster of neurons at t 
from each of a cluster 


connected layers connect every neuron in “ 

layer. Itis in principle the same asthe EE 

multilayer perceptron neural network (MEP). The flattened matrix goes throug 
y connected layer to classify the images. 


n " ives input from 

Receptive Field — In neural networks, each neuron receiv ted laye 
some number of locations in the previous layer. In a fully pet Ina 
each neuron receives input from every element of the pe subarea 0 
convolutional layer, neurons receive input from only a restricte ‘oes 
the previous layer, Typically the subarea is ofa square shape cE connected 
The input area of a neuron is called its receptive field. So, ina aor 
layer, the receptive field is the entire previous layer. In a convo 
the receptive arca is smaller than the entire previous layer. ‘pit 

Weights — Each neuron in a neural network computes atte field 
applying some function to the input values coming from pes is specif 
the previous layer. The function that is applied to the inpu 


1 laye 


alue by 


Lutio, tiong |) 
N ey ty 
Mulatas : 


f ed to kn 4 
this architen ail Gs 


i 
Reeptive 
1S own bias 


gurons $! 
a ee than each receptive field having 
n ral 


(iit ing (iv) Stride, 
plai (ii) Depth (iii) Padding (iv) a 
Filter Size of a convolutional layer are its spatial filter 
e hyp! sine They have to be chosen carefully in order to 
sts nd stride. 


gip pica ee 
“gona desired O filter size corresponds to the spatial extent (»idh 

(i) Filter Size - bes ` are convolved with the input image at different 
guvigitdof the ~~ Jly, the filters are quadratic. i.e. have the same with 
fl caon® eam OF a particular convolutional layer have the same 
sight, and all filters heavily dependent on the spatial extent of Hest 
a A on senerally see larger filters for larger input _ 
eects in aoe the trend is to replace larger ne iy 
Teln fae, which is motivated by a reduction of learnable parame 
alsibstantial speed improvements. 


; ber of 
(i) Depth — The depth of the output volume een ie ees Allof 
‘auble filters that connect to the same region of ae ut, For example, 
efler will learn to activate for different features of the inp! 


a 7 so then different 
-W first convolutional layer takes the raw image as ie of various 
bes along the depth dimension may activate in the pre 

"ened edges, or 


blobs of color, for instance. 
ù (i) Padding — The padding parameter or m : 
eer e co volumes by extending the ae a utiy preserve t e 
lsira co Particular, sometimes it is desirable to "ding that is More 
tieto y TC input volume. In order to perform pa fy used methods 
tnea ÈE content of the original image, two commonly rs the border 
nd repetition padding. Reflection padding ar the padded 
mage Outwards, whereas repetition pens eae 
© values as the outer edges in the original í ptumns aroun 

The stride parameter controls how depth S depth column 
Siya is allons ĉlocated. When the stride is one, a 16 Thig eads 0 
aiy f llocateg in positions only one spatial unit m ‘o large output 
S Teceptive fields between the columns, and “poldswilloverteP 

igher strides are used then the receptive et spatially 
Output volume will have smaller dimens 


ontrol the spatial 
with zeros at its 


‘tio 
at 
“Si th 


112 Machine Leaming ( Vi-Sem.) 


0.3. Define the term sub-sampling layer. 

Ans. The subsampling layer performs 
input maps. This is commonly known as 
number of input and output feature m 
there are N input maps, then there will be exactly N 
down sampling operation, the size of each dimension 
be reduced, depending on the size of the dow: 
a2*2down sampling kernel is used 
of the corresponding input dimensi 
formulated as — 


the down samp] 
the poolin 
aps does not 


Ope 
£ layer, ne at 
Change. For ayi 
Output maps, Xan 
n Of the Outpy 


xj = down(x/-!) 
where down (-) represents a sub 


layer and selects the avera 
Pooling, the highest value j 
maps. Therefore, the out 
Special cases, each outp 
sub-sampling layers hay, 
and sub-sampling wi 


Q.4. Explain in detail about the co. 


nvolution layer, 
Ans, 


i : 3 » based 
~ Or convolutional networks in general — is loosely bas 
Cellsiin the mammalian brain’s visual corte 


(a) 3D 


(b) 2D 
Fig. 3.1 Input, 


A P d 3D 
Filter and Output of a Convolution in 2D at 


Unity 194 
ass. the learnable filters are convolved wi 

he forward Pind the discrete convolution Operation 
poring” intuition jume at different overlapping spatial locati 

e u 
‘ input vo 

ae er the iP" 

o 5 


i Xf 


th the input 
'S te slide a 
ONS as iñ 


UW 
he currentilayer’s output for a given filter £ y!/-li 
sents i ver, and the spatial extent of the filter į 
7 n by wand v 


the 
n horizontal 


(/) represe 

i AE he previous lay 

direction is g8IVe wer Recah, 
kward pass, we calculate the partial derivatives of the toss 

c Y the weights and biases of the respective layer as in 

e 


ia Vy (OL — Efva) SR 
ý u,v uv 
vue) £= [rye] 


u,v uv 


wre, w denotes a spatially flipped filter in order to compute the cross- 
Wl x 4 
wrelationtather than a convolution. 


05. What do you mean by pooling layer ? Explain. 


Ans, Another important concept of convolutional networks is pooling, which 
Saform of non-linear downsampling. The pooling layer Partitions the input 
‘ome into a set ofnon-overlappin g rectangles and, for each subregion. Page 
‘maximum activation, hence the name max-pooling as shown in fig z 
‘nother common Pooling operation is average pooling, which computes n 5 
ean of the activations in the previous layer rather than the maximum ae 
i of the pooling layer is to progressively reduce the ner 3 3 "i 
sentation to reduce the amount of parameters and Bla os n 
mind ence to also control overfitting. It is SE ates 

Yinserta Pooling layer in between successive convolutic 2 


@) Globat (b) Local 


Fig. 3.2 Max-pooling Layer 


Duri 

v 

Wio È the fory ard 
À; tivation, ; 


ns of the 
r ing regions 0i 
Pass, the maximum of non-overlapping 
ts is Computed as in 


f e-i 
xO = maxlx h 
u y 


Unit-iit 
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15 
npected layer can be computed with a dot product 
{ „ofa fally COY ayer activations followed by a bias offset as in 
where u and v denote the spatial extent of the non-overlapn; | D e previous Hi T xt) 4H 
width and height. Ping ti ts mc ya (w 
Since the pooling layer does not have Bre rs 
backw 


ard pass is merely an ups: 
In case of the m 
the inde 


wig 
x 

; any learnable aran £ 

ampling Operation of the Upstreg Meters the 

ax-pooling operation, jt is common practice uk Tivati 


y. 
the gradient with respect to the Weights a 
x of the maximum activation so that the gradient can be he tracker F pote ckward pass» 1 © 
“to 7 


is origin during back Propagation, 


a nd 
Ward gine bA sin l 
The hyperparameters ofthe pooling layer are its Stride and filter R ' oo computed ? viol = «© yt (VC) L) 
they have to be chosen in accordance to each other, they can be inte Ze. Since Ww f in 
the amount of downsampling to be performed, "preted a ic = EVD L) 
Q.6. What is fully connected layer ? Discuss its limitation, , a 
Ans. The fully connected layer is a synon: 


ym Often used i 
network literature and is equivalent to a hidden layerina Tegul 
It is sometimes referred toas Ij 


layer is responsible for the hi 


in the Convolutiony 


ar artifi 


eam derivatives. 
yum denotes the upstr 
ial netwog MAA 


i Of output 
h iyhyperparameter MAAE a en > bees pair 
s the fully connec; {ofteon i.e. how many learnable parai 
gh-level reasoning in a convolutional neural ee J aostheinput connects to, 1.€. 
and is therefore typically inserte 4 


have full connections to all acti 


ayers. The neurons fatto the output. 
artificial neural network as 


i tion that 

The main limitation of the fully comiected pees a cena of 

shown in fig. 3.3. whinput feature, i.e. pixel in the image, is p. etree Timis: 

sang pixels and contributes equally to the pre i hight correlated 

‘weve pixels that are close together have the tendency to be ig 7 a 

zi tus the spatial structure of images has to be taken in FEN 
‘itonally, fully connected layers do not scale well to high dimensi 


: the layer’s 
atas images since each pixel of the input has to be connected to the lay 
“Nuwith a learnable parameter. 


Q7. What do you mean by loss layer ? ineo 
S igi hayer is one of tiie ensena partè o a ea p SiAn 
tang Ua supervised learning context, our network comp papa 
ch we in turn compare to its correspon oak 
“unhappiness” with the predictions that oipe also 
8 the current parameters with a loss function. ede 
Bitag SOSt function, objective function, or criterion. f i connected 
Mra epa NeXt, it is useful to regard the output from the last fully thus have 
* eroun °S denoted by s. For the i" training example, we 
abel y; and the corresponding score §}. 


+ aan dimensional 
"mined by the parameters @. In this two dime 


Layer fi 

(a) One Hidden Layer atn putimage, whi 

el We Measure the “ 
“88 usin, 


IS dete; 
Siha i tryin, 


ed} into the 
gto T š ith a high loss (red) in bei 
“bron, OW log, Move from the areas with ing the parameters using 
Spe tion, ih (blue) by incrementally adjusting t re might nat be 
(b) Two Hidden Layers Neural Network M that there '™Portant to note that the global mint 
Fig. 3.3 Fully Connected Layers in an Artificial 


ay exist several local minima. 
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0.8. Explain the term dense layer. 


Ans. The dense layer uses the linear set 
of connections between input and output. All 
units of a dense layer are fully connected to all 
units in neighbouring layers in such a way that 
the units take every output from all units in the 
former dense layer as their input and pass their 
output to all units in the following layer. 

This leads to a very simple matrix-vector- 
computation, but it also leads to a very large 
set of trainable parameters. A neural network 
that consists of dense layers can be very 
successful for low dimensional data, but 

computation can become very expensive for 


high dimensional data like images. To handle Fig. 3.4 A Densely Connecti 


such tasks, layers with weight sharing are Neural Net with Two Hidden 
introduced. 


Layers 


Q.9. Explain the term densely connected network (DenseNet). 
Ans. Densely connected network (DenseNet) is devel í 
4 oped b; 

consists of densely connected CNN edie 


layers, the outputs of each 1 
J ` ayer are 
connected with all successor layers in a fn 


) s dense block. Therefores it is formed 
with dense connectivity between the layers rewarding it the name DenseNet 
This concept is efficient for feature reuse, which dramatically reduces network 
parameters. DenseNet consists of several dense blocks and transition blocks. 


which are placed between two adjacent dense blocks. The conceptual diagram 
of a dense block is shown in fig. 3.5. 


43 


Fig. 3.5 A 4-layer Dense Block with a Growth Rate of K” 3 


fiaa 


-o feature sasi “ig. 3.5 illustrate 
pe preceding feature maps as input F oe illustrates 
rakes all t Net schematically. Consequently, the / layer received 
of seNets ja : 
z e ‘previous layers Of Xg» Xps Xg- -+ Xj} AS input - 
m 
xa” H; (io Xr Xr 7 xa) 
J] are the concatenated features for layers 9, ....., 
xy de d as a single tensor, It performs three different 
xe% ered as 5 ; 
X ae $ cons jatch-nortfialization (BN), followed bya ReLU anda 
ie rations- uon. Jnsthe transaction block, 1 = 1 convolutional 
2 era F er è k; A 
oavolution ed with BN followed by a 2x 2 average pooling ayers 
are perfo ofethe-art accuracy with a reasonable number of 
W model shows state- 


js Ne rameters for object recognitions 
sawork pa 


in the following CNN architecture — 


laii a 
a (ii) AlexNet (iii) GoogLeNet. 


S sed i 0s, limited 
LeNet was proposed in the 1990s, 
PO ood men ity made the algorithm difficult to 
jon capability and memory capacity made the alg 
a ae about 2010 however, proposed CNNs with the back ner 
x i i igi to achieve state-of- 
doorithm and experimented on handwritten digit dataset Oa pene 
teart accuracy. The proposed CNN architecture Is well-known as 
Thebasic configuration of LeNet-5 is as follows (see fig. ae se oo 
ony) layers, two sub-sampling layers, two fully connected layers, an pia 
yer with the Gaussian connection. The total number of weights an y 
ndaccumulates (MACs) are 431 k and 2.3 M, respectively. e 
ingi ability, CNNs state 
be computational hardware started improving 10 capability, C } ses 
_ Popular as an effective learning approach in the computer vi» 
aN A é 
machine learning communities. 


Output (10) 


FA 
a 
A 
% 
a 
v 
= 
© 
£ 
2 
S 
= 
© 


Subsampling (6@14x14) 


Convolution (16@ 10x10) 


Layer #1 
Layer 42 
Layer #4 
Layer #6 


Layer #3 
Layer #5 


Fig. 3.6 The Architecture of LeNet 
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(ii) AlexNet — Alex Krizhevesky proposed a deeper 
model compared to LeNet and won the most difficult itiieNe Wi 
visual object recognition called the ImageNet Large Scal et ch 


e Vis 
Challenge (ILSVRC) in 2012. AlexNet achieved state-of sual R, 


accuracy against all the traditional machine learning pia te 
approaches. It was a significant breakthrough in the field of mari ler vis 
and computer vision for visual recognition and classification a Ie 
point in history where interest in deep learning increased rapidly S andica® 
The architecture of AlexNet is shown in fig. 3.7. The first & ig 
layer performs convolution and max-pooling with local response malu i 
(LRN) where 96 different receptive filters are used that are 11 x i Pai 
max pooling operations are performed with 3 x 3 filters, with a stride 
The same operations are performed in the second layer with 5 x 5 flera 2 
filters are used in the third, fourth, and fifth convolutional layers with 384 a 
and 296 feature maps respectively. Two fully connected (FC) layers ie i 
with dropout followed by a Softmax layer at the end. Two networks with simiar 
structure and the same number of feature maps are trained in parallel for tis 
model. Two new concepts, local response normalization (LRN) and dropout, 
are introduced in this network. LRN can be applied in two different ways -Fint 
applying on single channel or feature maps, where an N x N patehlisselecti 
from the same feature map and normalized based on the neighbourhood values 
Second, LRN can be applied across the channels or feature maps (neighbourhood 
along the third dimension but a single pixel or location). 


Cony., MXP, LRN 
Cony, & ReLU 
Cony, & ReLU 

Soft-max 


Layer 1:96 
Layer 2: 256 
Layer 3 ; 384 
Layer 4 ; 384 
Layer 5 : 256 
Layer 6 : 4096 
Layer 7 : 4096 


Fig. 3.7 The Architecture of AlexNet Convolution, 

Response Normalization (LRN) and Fully Connected sar Ay 

~pristian 97°... 

(iii) GoogLeNet — GoogLeNet was proposed by ae jet? 
in 2014 of Google with the objective of reducing computa! 
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~NN. This method was to incorporate incepi 
ve fields. w hich were created by diff if 


2d operations that captured sparse 


Correlation 


Fig. 3.8 Inception Layer — Naive Version 


the initial concept of the inception layer canbe seen migas a 
te-of-the-art recognition accuracy using astack of inception layers, 

k ee The difference between the naive inception jayer and final 
te the addition of 1x1 convolution kernels. Hisse panei 
iwed fordimensionality reduction before computationally expensive layers. 
iogleNet consisted of 22 layers in total, which was far greater than any 
emok before it. However, the number of network parameters GoogLeNet 
vedas much lower than its predecessor AlexNet or VGG GoogLeNet i 
network parameters when AlexNet had 60M and VGG-19 138M. The 


‘mpwtations for GoogLeNet also were 1.53G MACs far lower than that of 
Aet or VGG. 


Filter 
Concatenation 


[ Previoustaver_] 


Fig, P 
* 3.9 Inception Layer with Dimension Reduction 
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INCEPTION NETWORK, INPUT CHANNELS +, 
LEARNING, ONE SHOT LEARNING, DIME Ne SFR 
‘ON 


REDUCTIONS, IMPLEMENTATION OF CN 
FLOW, KERAS ETC, TE TENsog 


0.11. Write short note on inception network. 


Ans. Inception network is often difficult to determine the b 

for your network, and whether to use pooling layers. To overcon Ee filter Sizes 
architectures use many different filter sizes and pooling aati incepi 

inception block), the outputs of which are concatenated and IN parallel (ay 
next block. In this way the network chooses which filter sizes te aik to the 
thereof to use. To solve the problem of a large increase in is orbitat 
the inception networks utilize 1x1 convolutions to shrink P 
next layer. This network architecture was introduced by Szeged a i 
network deeper and wider, hence more powerful, but keeping the nh i tale 
cost low. The inception network could thus go very deep and like Reset ull 
intermediate normalization layers to avoid vanishing and exploding ail 


Q.12. Define the term input channels. 


Ans. In order to generate several inputs for the CNNs ensemble, we 
compute three input channel from the original RGB proposals. In total there 
are four input channels, denoted by — x, € RH*W*Pyfor t &T = {RGB. GH. 
GM, LUV k which correspond to RGB, normalized sum of gradient histograms 
for six orientations, gradient magnitude (computed from each channel in RGB) 
and LUV. Since, originally, the normalized sum of gradient histograms forss 
orientations only has one channel in the third dimension, we replicate ths 
channel to obtain the remaining depth dimensions: 


0.13. What do you understand by transfer learning ? 
r artificial intelligent 


Ans. Transfer learning is amessential ingredient fo ferred © 
ster 


where knowledge learned in one domain for some task can be tra = 
enothen domain for a different task. In the context of deep learning: nk 
learning is commonly implemented by pre-training a deep neural sated 
®.g.. a CNN, on a large labeled dataset followed by fine-tuning (cont 
training) on a different dataset of interest, which is usually smal 
approach has been successfully applied in many areas of computer Ve" 
-kt iş well known that the initial layers of a CNN alway® mae ai 
ea Honig features (eg, edges, corners, curves, color eu hie 
pplicable to all types of images, whereas the final layers TPE” jv- 
abstract and data-specific features. Therefore, using a model optima 
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„tuning it on a different dataset js a 
- sẹ Spee 


pen fine ‘ led i 
and a faster training process, compared jie ed ig 
cv ME P 4 Maing 
h due to the shared features present in the inital 
gerat?” > of sarninn t ` JTS, 
c ower of transfer learning for classifying alituhe 
i ar CNN : glitches. we 
a fthe most popular CNN models for object recou S. we 
gyám ce O dVGG at SOLnition, 
fo rep? on? and3 ResNet, and VGG These C NNS were optimally 
ception et of images q E ImageNet, Which contains | 2 million 
Agma i egl-world objects belonging to 1000 categories over the course 
ait searc , n 
aioe a multiple GPUs by other research groups. We obtained the 
awe ks ye 7 yom these models, and used them to initialize the CNNs 
‘see welg each model on the dataset of glitches. id 
S eetunin; S 
apne fine nsfer learning allows us to apply these powerful convolution 


ng glitches using a very small training dataset of 
esto obtain significantly higher accuracies, reduce the training time 
e orders Of magnitude, and eliminate the need for optimizing hyper- 
Ws Furthermore, we show how information from multiple duration 

: ms can be efficiently encoded into a single image in different color 
hance information provided to these CNN models, 


at transfer ea 
for classify! 


donnels to € 

0.14. Write short note on one shot learning. 

dns. Neural networks classically contain a single data path from input to 
aput. This is because neural networks are often trained to perform a single 
sand thistask does not change during the inference stage. If the task changes 
rales rence the neural network would need to be retrained, in order to 
ty i Often it is not an option to retrain a neural network for 
liens : is is because there might be a lack of training data or the 
Pilem af sin, aa is not powerful enough to perform SGD. = 
‘nines one ie a training data, or even only one ac a 
SM algorithm is pi ing. Tracking is in essence a one shot leaming sin = 
over multiple trary one example of the target and asked to tract 

Tames, 

erform one shot 
a feed forward 
sternal memory 


MOTE rece = 
ont Mie Eae lizes a neural turing machine to p 
‘ile ih 'ecurrent p Ine consists of a controller such as 
‘nly his machin eural network that interacts with an ex ei 
Many Pate, fia has long term storage in the nctwork weights which ar 
“ag, Memory ig: term storage in the form of the aforementioned 
Shot Proble è This structure achieved better results thana huma ; 
using the dataset and was a big step ! d compare’ 
yal that ti e“ ing networks 
at, besid time. The “matching netwo 
mai es designing for one shot leaming ® 


NE can į 
an 
improve results even further. 


4 
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Q.15. What do you understand by 
any two methods of dimensionality red 


dimensionatity Ted, 
uction, a sae i 


ek, 
p e t 
n, data €ncodin OF tra 

Ompressed” r nske 


Ans, In dimensionality reductio at were not 
m ; “e f ae no 
applied so as to obtain a reduced or cl Fes Tina ¥ se : ; 
re entati ony <, smaller interpretations that wou 
data. Ifthe original data can be reconstructed from ihe eo lation OF the eae j i this smal nd th ereby allows Pp ld not 
any loss of information, the data reduction is called | ct 
can reconstruct only an approxim 


ation of the Origin o steag ds 
A ETAR al day hy 
reduction is called lossy. There are several a y 


well-tuned al ori iiy 
compression. Although they are typically lossless, they cba re 
manipulation of the data. Two Popular and effective meth Only jin 
dimensionality reduction are as follows — o 


. ias è applied to ordered and 
mS nexpensive, can be app 
re See ai sparse data and skewed data. 


1 tributes than two dimensions can be handled hy reducing 
onal data of MOTE 


-_onsions. Principal components may be used as inputs 
1 Pmt two dimen d cluster analysis. In comparison with wavelet 
. f sion an ling sparse data, whereas wavelet 
wi ; i s — The discrete Wavelet transform (yr, f°" acatends to be, better at handling spi 
is a linear signal processing technique that, when applied to n 
transforms it to a numerically differen 


a data vecigey iiss more suitable for data of high dimensionality. 
‘vector, X', of wavelet Coefficients Bag. f% ia 
vectors are of the same length. When a f 


F eel 


(i) Wavelet Transforms — 


i i ural network (CNN). 
pplying this techni plain the implementation of convolution ne (CN! 
PE PEE z imens : Ae 1O data reduin f qA BP ion of convolution neural network are as follows - 
we consider each tuple as an n-dimensional data Vector, that is, X = (1.43...) | ge The implementation o 
The usefulness of this technique lies in the fact th ‘i 


at the wavelet transfor: 
d approximation of the data Can be tings 
f the strongest 


data can be truncated. A compresse: 
by storing only a small fraction o of the wavelet coefiicin: 
For example, all wavelet coefficients larger than some user-specified thresh}! 
can be retained. All other coefficients are set to 0! The resulting dats 
Tepresentation is therefore very Sparse, so that operations that can take advantage 
of data sparsity are computationally very fast if performed in wavelet space. 
The technique also works to remove noise without smoothing out the mii 
features of the data, making it effective for data cleaning as well. uur 
of coefficients, an approximation of the original data ean be constructed by 
applying the inverse of the DWT used, | 
The DWT is similar to the discrete Fourier s form (DET) 2 % 


i 3 š A A " h /T achiev 
processing technique involving sines and cosifiés. ib. DWT ac 
lossy compression, Hence, 
less space th: 


isi i 7 e leverage 
WE maon a directed wcyclic gph (DAG) bt coat 
sk DNNS. It is based on a directed acyc ic grap! sr NE 
sic mathematical operations, and edges, w r Baras 
slie, multidimensional data arrays). This computational grap! i 
twel of abstraction to represent the algebraic computation of th 
cuted deep neural network models, independently of the on 
ax usd, the execution environment, and the target hardware. Thus, it 
sTesorFlow users with the flexibility to execute their computation on 
"ta CPUSIGPUs, or even mobile devices. 


i valuation. This 
Tutsa concept known as deferred execution or lazy evaluation 


Sbatthere are two Principal phases in a TensorFlow program - 


3 includin 
(a) A “onstruction phase, that assembles a graph, including 
Operations, 


r WT regut 
for an equivalentapproximavi n pow 
an DFT. In DWTäwavelets are quite localize in py A pes 
to the conversation of local detail. There is only one DFT. yet “paibest™’ 
families of DWTs. Popular wavelet transforms include the pe plie! ” 
4 and Daubechies-6 transforms. Wavelet transforms can 
multidimensional data, such as a data cube. 


; P r e erations 
a iw An execution Phase that uses a session to execute op: 
Yen sults in th 


; XLA | 
Reni © graph. The constructed DAG allows ad he 
Tide ® optimized and faster code for any targeted deploy 
* indeed 
Ins” 


designed 

> the TF execution engine schedules the : 

t i p f resources. We 

r suppose that the aa ma Way that ensures the efficient use of reso larity 
(i) Principal Component Analysis — Se nies or dimer a Shag Verification routines on TE because of its high po A 

reduced consist of tuples or data vectors described mani > Karhunen- ae re in 

Principal components analysis, or PCA (also called the oa! 


: -vo between a 
d 5 community, The communication interface betwe' 
(S 


F TFCheck library is its corresponding DAG. 
Fas Lp. ? 


J vectors hat “aed 
* . 2 Ui 
K-L, method), searches for k n-dimensional orthogon@ thus pt! 


‘sinal data are 
be used to represent the data, where k < n. The original 


sAPI, 


d Theano. 
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Thus, Keras benefits from the advantages of both and provides t 
and more intuitive set of abstractions, which make it easy t S thia, 
networks, regardless of the back-end scientific computing libra ; Ure rey 
motivation behind Keras is to enable fast experimentation with d “Ting 
networks and to go from idea to results as quickly as possible, $ net 
consists of numerous implementations of neural network by; 
tools to make working with image and text data easier, 


Keras offers two types of deep neural networks including sequences, 
networks (where the inputs flow linearly through the network) ia, Se 
based networks (where the inputs can skip certain layers). Thus, implant 
more complex network architectures such as GoogLeNet and Squeerern® 
easy. However, Keras does not provide most state-of-the-art pre-trained mb 


ay 


as i} 
ilding blocks 


0.1. Explain in detail about recurrent neural network. 


Ans, A recurrent neural network (RNN) is an extension of a conventional 
geiforward neural network, which is able to handle a variable-length sequence 
ipt. The RNN handles the variable-length sequence by havinga recurrent hidden 
sale whose activation at each time is dependent on that of the previous time. 


More formally, given a sequence X = (Xj, Xz, ....., Xr), the RNN updates its 
meurrent hidden state h, by 


f] 9 t=0 . 
h= loh >X), otherwise ~(i) 
function such as composition of a logistic Sigmoid 


rmation. Optionally, the RNN may have an output 
» Yt) which may again be of variable length. 


ally, the update of the recurrent hidden state in equation (i) is 
as — 


stere 4 is a non-linear 
‘than affine transfo 
Ey, Farse 

Tradition, 
"plemented 


there pig í h, = g(Wx, + Uh,_,), „G 
Mperbolie p20, bounded function such as a logistic sigmoid function or 
olie tangent function. 


A generati 
T 
n Sequ alive RNN Outputs a probability distribution over the next element 
is 


ence. win: 
tibutig, hi given its current state h,, and this generative model can capture 
Geet y - Sequences of variable length by using a special output symbol 
Posed into end of the sequence. The sequence probability can be 
Sag *®T) = P(x) pO | x4) pis | Kye X2) one POST | Me ve SE) 
Ai 
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smory via the introduced gates Intuitivety, if 
e . si ai 
where the last element is a special end: 


-of-sequence val 
conditional probability distribution with HS: We mody ™ 
PU% | Xis ees Mpg) = g(h) 
where h, is from equation (i). 


isting M s 
gex!’ feature from an 
ier an important 1e 


tS it easily carries 
eter rly stage, } 3 
tearly stag 
nce a 


"the LSTM 


rd pa (the existence of the feature) 
? a 
0.2. Briefly explain the term long short-term ei 


i 3 e, capturin 
pit „ distance, hence, cap g 
Ans. The long short-term memo 


ünit, jong 


a i dependencies; v 
; À M -distance 
: À nemory (LSTM) unit was initially py, tial long . Pm. © 
Hochreiter and Schmidhuber. Since then, a number of minor mo, in ; fand o are the input, forge 
the original LSTM unit have been made. tations | pere, f 


es, respectively. € and © denote 
p {| and the new memory cell 


Out 
Unlike to the recurrent unit which simply computes a w; 


input signal and applies a non-linear function, each j" LST 


á M unit Maintains, 
memory c; at time t. The output hj, or the activation, of the LSTM unitis 
¥ i 


Wp 
š memory cel 


eighted SUM of he 
l ntent js shown in fig. 4l 
ce 


Fig. 4.1 Long Short-term 
Memory 

3, What doyou mean by gated recurrent unit ? Explain. 

03. 


hi = of tanh (cl), A gated recurrent unit (GRU) was proposed by Cho et al, to make 

ae ee unit toadaptively capture dependencies of different time scales, 

where aF to the LSTM unit, the GRU has gating units that modulate the flow of 
Simi 


fomationinside the unit, however, without having a separate memory cells. 
wo 


of = An output gate that modulates the amount of memory conte, 
exposure. 


The output gate is computed by The activation hi of the GRU at time t is a linear interpolation between 


ol = O(Wox; + Ushi + Voci)? t previous activation hi, and the candidate activation h 


where © =A logistic Sigmoid function 


h} = (I—zf pb}, +2) hi, =A) 
Vo =A diagonal matrix. ‘here an update gate z} decides how much the unit updates its activation, or 
The memory cell ci is updated by partially forgetting the existing memor | in 


tent, The update gate is computed by 
and adding a new memory content ©] — 


i zl = o(W,x, + Uh)! 

j Jl. eng -AI hi PETT 
ct = Filet) +S Ms procedure of taking a linear sum between the existing state and the 
Tey, = eo X 
“Y computed State is similar to the LSTM unit. The GRU. however, does 
“have any mech 
by poses the wi 


where the new memory content is 


i Pie ey Ng anism to control the degree to which its state is caposed, 
i Jated by? hales f 
r SIA h otten is modu ole state each time. 
The extent to which the existing memory is forgi seni 
j tis 2 
forget gate f}, and the degree to which the new memory conten! 


he ; a 
a is candidate actiy ation hj 
2 è -j are compute! 
the memory cell is modulated by an input gate i}. Gates a 


rey is computed similarly to that of he tradiuona! 
nt unit £ F 
unit from equation (ii 


Jin Ot. 


à j 
fÈ = o(Wex, + Uphyy + Veeri) 
j 

il = o(Wix, + Ujhy_) + Vier) 


7 -Oh J 
Wire ys hj = tanh (W x, + U(r,Oh,_;)) 
Set Of rege 


lcl t gates and © is an element-wise multiplication. When 
¥ i y oop “Se to 0), the rece ` s the unit act as if it is reading 
Note that V;and Vj are diagonal matrices. we cont? | Tint g ~ eset gate effectively makes the unit act a 
i vrites 1S j a o F w aqoudly 
ae = it which overwrites (6° Toe DA Of an i a ce, allowing it to forget the previously 
Unlike to the traditional recurrent unit which ove acide W Ute Sate Aput sequence, allowing it to g 


ii di 
ii) in Q LST) tisable t0 
each iime-step from equation (ii) in Q.1, an LSTM unt 
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‘The reset gate r? is computed similarly 
to the update gate ~ 
= o( Wx, +U-hy_y) 


Here, r and z are the reset and update 
gates, and h and h are the activation and the P 
candidate activation. The gated recurrent unit 


is shown in fig. 4.2. Fig. 4.2 Gated Recurreny 


0.4. How can we use SMT to find synonyms ? 

Ans. The word “ship” in a particular context cay 
language the same way that the word “transport” ¢ 
word “ship” is synonymous with the word “transpo 
of a query such as “how to ship a box” might ha’ 
“how to transport a box”, 


The search might be expanded to include boi 
box” as well as “how to transport a box”. 


Uni 


n be translated 
an be. In that context t 
Tt”. So, our exa 


Mle abon 
ve the same translation a 


10 anothy 


th queries ~ “how 0 shipa 


A machine translation system may also collect information about words 


in the same language, to learn about how those words might be related, 
Q.5. Write short note on heam search. 


Ans. Neural sequence models are widely used to model time-series daw. 
Equally ubiquitous is the usage of beam search (BS) as an approximie 
inference algorithm to decode Output sequences from these models. BS explores 
the search space ina greedy left-right fashion retaining only the top-B candidate 
~ resulting in sequences that differ only slightly from each other. 

The most prevalent method for approximate decoding is BS, which ses 
the top-B highly scoring candidates at each time step; where B is Knownasit 
beam width. Let us denote the set of B solutions held by BS at the ee 
taS Yiey= (Yi fetge + Yp, pigh At each time step, BS "cance? 
single token extensions of these beams given by the set Y, = Yu) 
selects the B most likely extensions» More formally, at each step. 

Tys 


arg max Kove pp Stygg YMN 0 
Vite YB, yey, be[B] alB 
The above objective can be trivially maximized by sorting is prow 


members of Y, by their log-probabilities and selecting the top Bett 

is repeated until time T and the most likely sequence is selected by 

B beams based on log-probabilities, sdin p” 
While this method allows for multiple sequences to be explore’ oe ot! 

most completions tend to stem from a single highly valued oe el? 

outputs that are typically only minor perturbations of a single 5 


? 
ger 


goding © 


tive from the chatbot for an input text from the usel 


ulity. BLEU uses n-grams 


gown the disadvantages of beam search, 
+ dow 


gb we „dvantages of beam search ate as follows 

ins e en eduction of near-identical beams make BS a computa 
sede with essentially the same computation being repes 

af de ain in performance. 


evaluation mismatch i.e. improvements in posterior 
rily corresponding to improvements in task 
pitt pnmon practice to deliberately throttle BS to bec 
tof ce rithm by usingweduced beam widths. This treatment of an 
n ‘Iyorithm as a hyper-parameter is not only intellectually 
(inZ so but also hashagsignificant practical side-effect it leads to the 
(si argely bland, generieyand “safe” outputs, e.g. always saying “I 
” in conversion models. 


pecifie 


ne a Poorer 


„jnization al 


a Most importantly, lack of diversity in the decoded solutions is 
ioimentally crippling in Al problems with significant ambiguity - e.g. there 
i 


nultiple ways of describing an image or responding in a conversation that 
ie 


orrec?” and it is important to capture this ambiguity by finding several 
iese plausible hypotheses. 


0.7. Explain the term BLEU score. 


Ans, BLEU (BiLingual Evaluation Understudy) is an algorithm that was 
wnnatily developed to evaluate how accurate machine translated text wi 


as. Here, 
“utilize the same approach to evaluate the quality of the text respons 


e that we 
T. BLEU is one of the first 
high correlation with the human Judgements of 
modeling approach to compare the response text 


tahation metrics to claim a 


fon the chatbot with the re 


ic ference text in the ground truth test data. Here for 
Project, We use paramet 


ers from 1-gram to 4-grams to measure how good 
- BLEU score can be calculated either on a sentence- 
has individual score) or the corpus-level (a single 
Proach to Bete tpus). Here we implement the sentence-level BLEU 
iy etus look fa cores for each response text generated by our chatbot, 

k EU Scoring a] an exampie to understand how exactly the sentence-level 
ne Sorithm works with n-gram parameters from 1 to 4 in table 4.1 

RI 


pa ses our chatbot are 
$ g 

maT input sentence 
on, the entire text co; 


Usong Tt 80 to the school now. 
` san't come back from the school now. 
Table 4.1 
l-gra; 


™ | 2-grams 
| 86.67 | 57.73 


3-grams | 4-grams | 
46.03 o | 
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gram models as tee 

4 subsequent Words i 

ofhow the BLEU hier ih 
Me 


case zero for 4-grams as no sequence of 
sentences. This is the general methodology 

As mentioncd earlier, 
BLEU score helps us to 
determine the next step for 
our model. As depicted in 
fig. 4.3, the methodology 
behind using BLEU score 
is to improve our model. A 
low score indicates that 
may be the performance is Score Methodology 
not as good as expected and so, we need to improve our 
calculating the BLEU score again. On the other hand, 
generating a relatively high score, then it is advised 
methods (may be neural evaluation) which will be able 
in a much better way. 


Fig. 4.3 BLEU 


V model and try 
if we are Successful ï 
to try other evaluator 
to evaluate our systers 


Q.8. What de you mean by attention model ? 


Ans, Attention model (AM), first introduced for machine translation his 
now become a predominant concept in neural network literature, Attento 
has become enormously popular within the artificial intelligence (A) 
community as an essential component of neural architectures for a remarkably 
large number of applications in natural language processing, statistical leang 
speech and computer vision. 

The rapid advancement in modeling attention in neural ane 
primarily due to three reasons. First, thse models aré now the agave 
for multiple tasks such as machine translation, question answering, ane 
analysis, part-of-speech tagging, constituency parsing and pe 
Second, they offer several other advantages beyond improving p 
on the main task. They have been extensively used for improving n e 
of neural networks, which are otherwise considered as black-box fairness: 
is a notable benefit mainly because of growing interest m= 
accountability, and transparency of machine learning models patiens Ë 
that influence human lives. Third, they help overcome gone i 
recurrent neural networks (RNNs) such as performance det esl 
increase in length of the input and the computational ae sim 0° 
from sequential processing of input. Therefore, in this wor Á 
a brief, yet comprehensive survey on attention modeling- 


reiability 
pi This 


Unity 434 


to two broad 


d local. These classes differ in terms of whether the 


jI source positions or on only a few so 


g fasta ; 4 5 ector e, 
er 0 erani source-side information to help predict the current target 
Te 


“je these models differ in how the context vector ¢, is derived. they 
r me subsequent steps. 
the sam! iven the target hidden state h, and the source-side context 
Pfr a simple concatenation layer to combine the in formation 
ctor Cp Y to produce an attentional hidden state as follows - 
y 
patolh h, = tanh (W, [eh] di) 


The attentional vector h, is then fed through the softmax layer to produce 
e 
j predictive distribution formulated as — 


pvt [Ye t+ X) = soft max(W, hy) (it) 


0.10. Discuss about the global attentional model. 

dns. The idea of a global attentional model is to consider all the hidden 
tiles ofthe encoder when deriving the context vector c, In this model type, a 
aruble-length alignment vector a,, whose size equals the number of time 
‘tps on the source side, is derived by comparing the current target hidden 
‘ach, with each source hidden state f, - 


The global attentional model is shown in fig. 4.4. % 


Fig. 4.4 Global Attentional Model 
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g i t vector a, is now fixed-di si 
Here, at each time step t, the model infers | alignmen t d-dimensi 


a variable- ch, the toca f the model as follows — eee 
weight vector a, based on the current target state h, and alla en, lip, j apor a variants 0 
global context vector ¢; is then computed as the wéighted SOurce States $ po Th otonic Alignment (local-m) — We simply "AP ieee) 
to a,, over all the source states, erage, dean © o M gipet sequences are roughly monotonically aligned, The 
= ae = " prce 49 is defined according to equation 
a(s) = align(hy, hy) as vector ai my 
= = align(hy, A 
_ exp(score(h, ,h,)) A F a(s) gnr r : p wai) 
= SS... a a m . & = ste: ass H . 
2 exp(score(hy, hy) m predictive Alignment (local-p) — Instead o assuming monotonic 
m (i model predicts an aligned position as follows - 
ur 
Here, score is referred as a content- sige g 


4 . T " 
based function for wh; = §.sigmoid(v, tanh (Wh, )) (ii) 
three different alternatives — OF Which we Conid Pt p a 


hl h, and v, are the model parameters which will be learned to predict 


dot Wp g is the source sentence length. As a result of sigmoid, p, € [0, S]. 
z Twi sitions. > 1 ints near p,, we place a Gaussian distribution centred 
score(h,,h,)= jh; W,h h lignment po: ear Pp, i 
(hy hy) : alls 7 general ean Specifically, our alignment weights are now defined as — 
Ya tanh (W, [hy sh, }) concat ua _ (s =p)? 
Besides, in our early attempts to build attention-based models, we uses a(s) = align(h,, hs )exp| — 202 (iii) 
location-based function in which the ali 


gnment scores are computed from Solely 
the target hidden state h, as follows — 


a= sofimax(W,h,) location (il) 


Given the alignment vector as weights, the context vector epis computed 
as the weighted average over all the source hidden states. 


Weusethe same align function as in equation (i) and the standard deviation 
sempirically set as o = D/2. Note that p, is a real number, whereas s is an 
ileget within the window centred at p,. 

The model first predicts a single aligned position P, for the current target 
‘ord. A window centred around the source position p, is then used to compute 
iontext vector ¢,, a weighted average of the source hidden states in the window. 
The weights a, are inferred from the current target state h, and those source 
“Ssh in the window. The local attention model is shown in fig. 4.5. 

Yt 


Q.11. Explain the local attention model. 


Ans. A local attentional mechanism that chooses to focus only.on a smal 
subset of the source positions per target word. 


This model tradeoff between the soft and hard attentional modelsto r 
the image caption generation task. The soft attention refers to a 
attention approach in which weights are placed “softly?” over all are ofie 
source image. The hard attention, on the other hand, selects one a e 
image to attend to at a time. While less expensive at inference vets 
attention model is non-differentiable and requires more complicate 
such as variance reduction or reinforcement learning to train. 


Attention Layer 


endow 

The local attention mechanism selectively focuses on 4 ae ill J 
context and is differentiable. This approach has an anes same time 
expensive computation incurred in the soft attention and i tails, e mod 
easier to train than the hard attention approach. In conor e t. The rs 
first generates an aligned position p, for each target wordat t of soure? pid 
vector ç, is then derived as a weighted average over the s€ 


lected. unlike t 
states within the window [p, — D, p, + D]; D is empirically s€ 


jected- 


j 


Fig. 4.5 Local Attention Model 
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j pare additional knowledge 


tof 


2. Wri sagt, . 2 
0.12. Write down the application of attention model, nit pyi ncorp 
e 
p 


aspect based sentiment 
Aas, Attention models have become an active ar 


of aspect related conce 


epts 
_ Aas. 2 ne ; Ea Of rese alo! apnd use attention a ie ari aig a soncëpis apart 
their intuition, versatility and interpretability. Variants Of atten a, pe w? a itself. Sentime : rita a 1 nee 4 os seen multiple 
been used to address unique characteristics of a diverse set of ap na Modal, Fit? o eo hE used with atten memory networks and 
eg. summarization, reading comprehension, language mo, deling a i ures ’ 
We discuss attention modeling in three application doriai Sing as we si ommender Systems AMs have also been extensiy ely used 
(i) Natural language generation (NLG) i 7 


systems for user profiling. i.e., assigning attention Weights to 

geommender 7 a user to capttifēilong and short term interests in a more 

weed oon This is intuitive because all interactions of a user are not 

re recommendation of an item and user's interests are transient 
t 10 


ee varied in the long and short'time span. 
as 
pve 


ee ORCEMENT LEARNING, RL-FRAMEWORK, MDP, 
REINFO R EQUATIONS, VALUE ITERATION AND POLICY — 
ERATION, ACTOR-CRITIC MODEL, Q-LEARNING, SARSA | 
E 2 SAO 


(ii) Classification 


(iii) Recommender systems. 


©) Natural Language Generation (NLG) - NLG tasks į 
generating natural language text as the output, Some NLG applicati es 
have benefited from incorporating an AM include 1 tte 


t : machine translation (ip 
questions answering (QA) and multimedia description (MD), 


(a) Machine Translation (MT) - 
translate text or speech from one language to another. Modeling attention 
neural techniques for MT allows for better alignment of sentences in differen 
languages which is a crucial problem in MT. The advantage of the attention 
model also becomes more apparent while translating longer sentences, 


(b) Questions Answering (QA) — QA problems have madeus 
of attention to ~ 


MT uses algorithms y 


0.13. What da you mean by reinforcement learning ? 


Ans. The machine learning program should be able to assess the goodness 
policies and learn from past good action sequence to be able to generate a 
nity Such learning methods are called reinforcement learning algorithm. 
frexample game playing, where a single move by itself is not that important 
asthe sequence of right moves that is good. A move is good if it is part ofa 
ind game playing policy. Game playing is an important research area in both 
“ortficial intelligence) and MI (machine leaming). This is because games 
ee and at the same time, they are quite difficult to play well. 
itel e chess has a small number of rules but it is very complex because 

age number of possible moves at each state and the large number of 
a 


e 
8S that a ae S ; 
“ta game contains, Once we have good algorithms that can learn to 
Well w 


(1) Better understand questions by focusingion relevant par 
of the question. 


(2) Store large amount of information using memorynetwok 
to help find answers. 


(3) Improve performance in visualQA task by modelin 

multi-modality in input using co-attention, id 

(c) Multimedia Description (MD) = MD is the om 

generating a natural language text description of a multimedia input pa pi 

which can be speech, image and video. Similar to QA, here anegao Pele 

the function of finding relevant acoustic signals in speech input ent 

parts of the input image to predict the next word in caption. Exploit the Tinie 

and spatial structures of videos using multi-level attention for video fram and 
task. The lower abstraction level extracts specific regions within a 


FAY pany 
Homie uti e can also apply them to applications with more evident 
tity, 
A Tob, ; 
«Ot navigating : : m tocationi 
Piication a ‘gating in an environment in search ofa goal location is another 
tofa p ee of reinforcement learning. At any time, the robot can move in 
umber of di = 


o rections. 
w ld, C; 
i 4 tively. * Give ¿ 
higher abstraction level focuses on small subset of frames seles A jy makt Any, Tr € the Ypes of reinforcement, 
ss + a s main $ 
(ii) Document Classification — Classification problems 


Te are twn 4, ee i 
© © two types of reinforcement are as follows - 
gh due Positive p 
e belig cy Particular 
Mour. In othe 


n À tations- 
use of self attention to build more effective document representa 


sais + the sentiment 4 
(a) Sentiment Analysis — Similarly, in the sent 


Ositive reinforcement is defined as when an event, 
r gterm 
task, selfattention helps to focus on the words that are important for det 
ask, s : ps to focus on the we a 


behaviour, increases the strength and the freg 


ency 
T Words it has a positive effect on the behaviour. 
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(a) Maximizes performance 


intages of reinforcement learning are as follows — 


(hb) Sustain change for a long period of time 


Disadvanta 


Too much reinforcement can lead to overload of states Which ¢. 


the results. 


(ii) Negative — Negative reinforcement is define 


2s of reinforcement learning are as follows 


an diminis 


das Strength, 


of a behaviour because a negative condition is stopped or avoided “iy 


Advantages of reinforcement learning are as follows — 


(a) Increases behaviour 


(b) Provide defiance to minimum standard of performance 


Disadvantages of reinforcement learning are as 


follows — 


Jt only provides enough to meet up the minimum behaviour, 


Q.15. What are the differences between reinforcement learning an 


supervised learning ? 


Ans. Differences between reinforcement learning and supervised leaming 


are as follows — 


Reinforcement Learning 


Reinforcement learning is all about 
making decisions sequentially. In 
simple words we can say that the 
output depends on the state of the 
current input and the next input 
depends on the output of the pre- 
vious input. 

In Reinforcement learning decision 
is dependent, so we give labels to 
sequences of dependent decisions. 


decision. 


Example — Chess game. 


0.16. Write the various practical applications of reinforcem el 
Ans. Various practical applications of reinforcement leam 


follows — 


(i) Reinforcement learning can be used in 
automation, 


Supervised Learning 


In supervised learning the decision 
is made on the initial input or the 
input given.at the start. 


Supervised learning the d 
are independent of each 
other so labels are given to 


Example : Object recogni 


jecis1005] 
of each 


ition 


nt lear 


ing are 


« ystridl 
robotics for indus 


Unity 137 


i = be us achine learn} 
forcement learning can be used in machine learning and 


Rein 


„essin& 


Rein : 
(iii) stom instruc 


: rcement learning can be used to create training systems 
> . . s 
fo tion and materials according to the requirement of 


learning can be used in large environments in the following 
e 


“ons — i is known, but an a ic i 
fo p del of the environment is nalytic solution 
i 


ae lya simulation model of the environment is given (the subject 
p . . 
 on-based optimization. 


y way to.collect information about the environment is to 


17.Explain the term Markov decision process (MDP). 
0.17 


Ans, Markov decision processes model tiing-disorete stochastic we 
fanition automata. An MDP = (S, A, P, R) consist ofa set of state S, a set oi 
xtons A, the expected (immediate) rewards Ros canis at ibe trenieron 
fom state s to state s' by executing action a, and transition probabilities È The 
mobability that in state s action a takes the agent to state s' is given by Peat At 
wery point in time t, the MDP is in some state S}. An agent chooses an action 
4e A which causes a transition from state s, to some successor state S14 with 
probability Be stay © Phe agent receives a scalar reward (or punishment) ny € R 
‘choosing action a, in state s, The Markov property requires that the 
Mbabilities of arriving in a state s,,, and receiving a reward te; only depend 
inthe state s, and the action a, They are independent of previous states, actions, 
ind tewards (i.e. independent of SeNi for t' <t). 

The agent that interacts with the MDP is modeled in terms of a policy. A 


‘etminist; : ei 

x mistic policy 4:S—> A isa mapping from the set of states to the set of 
"ts. APplyin 
ial case of 


N n 
A for each 
ti State g 


8 ™ means always selecting action 7 (s) in state s. This is a 
a stochastic policy, which specifies a probability distribution 
State s, where r (s, a) denotes the probability to choose action 


Q18, Exp lain 


An 
fit N Bellm, 
Fi a Baiso e 
s Sider the equation is important for the development of formalism to 
e 


ase of finding the optimal policy for an MDP in the case 


the term Bellman equation. 
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when the agent hypothetically knows the mode] Of the enyi 
E > Wonty 
Pans k dge = rowe i 
means the knowledge of the reward functions ang Mth 
cutes the optimal Policy ge T iy 


discounted sum of collected reward V(s,0*) = 


and transition 
agent starts in some state s and exe 


max ae ig 
ee axo Fol de ora ‘ 
s$. This function js the JM 


Olutigy b 


we call the optimal value of the state 


uni 
the Bellman equation — Ques 


{ 

V(s,0*) = max p(s.a)+y DT s,a,s'V(s',0%) 
acA seS 

The intuition behind this equation is 


S that the Optimal value 
the sum of the immediate reward for the optimal action in this 


expected discounted value of the next 


hvses i i i 
4 oll) 


OF the stag 
State anq ths 


step. Note that if the agent was pi 
the optimal valuc function, it would retrieve the optimal Policy uals, 6), 
Hg (s,0*) = areas p(s. + ¥>-T(s,a,s')V(s',6*) hVs eS “ty 
ae 


s'es 


0.19. Briefly 

Ans. Va 
value functi 
has been sh 
in fig. 4.6. 


describe the terms value iteration and policy iteration. 
lue Iteration — To find the optimal policy, we can use the optim 


on, and there is an iterative algorithm called value iteration P 
i WI 
own to converge to the correct V* values» ts pseudocode is show 


Initialize V(s) to arbitrary values 
Repeat 


For all seS 
For all aca 


tt 
Q (sa) Ens, pty D P (SIs a) Vs) 


V(s)— maxa Q(s, a) ses! 


Until V(s) Converge 


Fig. 4.6 a between 
¥ $ H nc! 
We say that the values converged if the maximum value differe 
~ two iterations is less than a certain threshold 5. 


Max vedis) -vgs 
SE: 
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-p counter. Because we care only 

jreration © ossible that the policy conve 

ue, it n Paren to their optimal va 

re the va aly there is only a smali numbe 
a JAD p decreases to O(k |S| JA). 


about the actions with 
Tes to the optimal one 
lues. Each iteration is 
TK<{S| of next possible 


val 


ex1 5 r 
as 50 compl jon — In policy iteration, we store and update the Policy 
f policy It TE indirectly overithe values. The policy iteration algorithm 
ma 
han dows 


Initialize a policy n' arbitrarily 
Repeat A 
T 
Comput the values using by 
solving the linear equations 


VR(8) = Elr[s, x(5)] +y J, PCs" | s, n(s)) Vs!) 
ses 
Improve the policy at cach state l 
~ n'(s) < arg max,(E[r|s, a] +y by PKs’ 1S, a) VHs’) 
ses 


Unitn=n' 


Fig. 4.7 

The idea is to start with a policy and improve it repeatedly until there is 
m change. The value function can be calculated by solving for the linear 
tations. We then check whether we can improve the policy by taking these 
ao account. This step is guaranteed to improve the policy, and when no 
provement is possible, the policy is guaranteed to be optimal. Each iteration 
‘this algorithm takes O (JA] IS + |S) time that is more than that of value 
tention, but policy iteration needs fewer iterations than value iteration. 


9.20. What is Q-learning ? 


a Qlearning by Watkins is an example of a model-free algorithm, 

value, mit Popular in machine learning. Analogous to the optimal we 

tein brce S the Optimal state-action value Q*(s, a) as expected discoun 

E contin Cnt Teceived by starting at state s and taking an optimal action a, 
in val optimal policy j14(s,6*). The optimal state- 


on val ing according to the 
Ue is a Solution to the following equation, which is a restatement of 


koga 
"ginal Bellman equation — 
Mis, a)= 


Pa) + YÈ TG,a,s')maxQ*(s',a'), Va e A,sES -0 
s'es aeA 

V(S,0*) = max 

acA 


Q*(s, a), so we can compute p1y(s,0*) from Q* without 


Q*(s, a) and therefore an optimal policy is given 
5 arg max 

acd 
T 
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The basic idea of the algorithms is that we maintain 
for state-action pairs, which represent the forecast of ¢ 
the agent would get starting at a given state by perfo 
following a particular policy from there on. These y. 


alues guide matone 
providing an opportunity to do a feasible local optimization at Siac Nig 
Given an experience in the world, characterized 


a, reward r, resulting state s' and next action a’, the sta 
for Q-learning is 


the set Of yg 

‘ tey; 
UMUlative Tein Sting. 
TMINE a piven. thy, 


by starting sate 

te action Pair Bae 

t 

Q(s,a) — Q(s,a) + of +y maa Q6és',a’) - Q(s, J 
ae 


iy 


While learning, any policy can be executed, as long as on an infinit y 
itely |g 


run each action is guaranteed to be taken infinitely often at each state 
learning rate œ is decreased appropriately. Under those Conditions the its t 
Q(s, a) will converge to the optimal values Q*(s, a mals 


) with Probability one, 
Q.21. Explain the SARSA algorithm. 


Ans. The SARSA algorithm differs from the 

in that, rather than using the maximum Q 

estimate of that state’s value, it uses the Q-value of the resulting state and the 
action that was actually chosen in that state. Th 

= sensitive to the policy being executed. 


classical Q-lea ming algorithn 
-value from the resulting state asm 


us, the values leamedar 


Given an experience in the world, chara 
a, reward r, resultin 
for SARSA (0) is 


cterized by starting state s, action 
g State s' and next action a’, the state-action pair update uk 


QUs,a) — Q(s,a) +alr + 7 Q(s',a") —Q(s,a)] p0 


In truly Markov domains, Q-learning is usually the algorithm of choice. 
In fact, Q-learning (QL) can be shown to fail to converge on very simple we 
Markov domains. Poli cy-sensitivity is often seenas a liability, because it makes 
issues of exploration more complicated. However, in non-Markov com 
Policy-sensitivity is actually an asset, Because observations do not uon 
correspond to underlying states, the value ofa policy depends on the damri 
of underlying states given a particular observation. But this distor 
generally depends on the policy. So, the value of a state, given 2 pom 
only be evaluated while executing that policy: Note that when SARSA ‘ei 
in anon-Markovian environment, the symbols s and s' in equation (i) €P 
sensations, which usually can correspond to several states. ‘eld 

The SARSA algorithm can be augmented by an eligibility trace. om 
the so-called SARSA (A) algorithm. SARSA (2) describes a class of ae the 
where appropriate choice of A is made depending on the problem. etto ht 
parameter À set to 0, SARSA (A) becomes regular SARSA. With 4 $ 
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_Carlo method, in which, at the end of every trial, each 
onte d toward the cumulative reward received on this trial 
P ir is 2 juste curred. Pure Monte-Carlo algorithm makes no attempt 
air m relating the values of subsequent states. Since it is 
fe iy the Bellman equation in partially observable domains, 
ssible 0 nantes choice. Generally, SARSA OQ.) with a large value of 
eo arl0 8 areas ae appropriate among the conventional reinforcement- 

i he ye solving partially observable problems. 
ms it note on RL-framework, 


ite sho $ . 
22. ee saN learning (RL) is a mathematical framework for 
s. Reinfore 


An learning optimal control strategy through trial nd error in a wide 
mous ides engineering and robotics. The control strategy or policy 
tates and) actions, which enables the agent to have 
ng betw N ion based upon the current state and its 
{selecting a good action based up c s 
powledge aon interaction with an environment. This learning process 
experience a 4 entacquires a promising performance and the whole process 
coinvesuntl Re ward. For obstacle-avoiding robots, the sensory input can 
Oe F ee ie the operation of the motor is regarded as the action. 
ae, a occurs or not can be defined as rewards. Through this 
w obstacle avoidance task can be formalized as a standard RL process. 
a inc useful information from the environment (forms such as images, 
wxand audio) is a key capability for training the agent. With the recent adv ances 
indeep learning (DL), relying on the neural networks powerful function 
proximation and representation learning properties allows an RL agent to 
ficiently learn features and patterns from high-dimensional data with multiple 
Focessing layers models. It has dramatically accelerated the developing process 
CRL, and deep reinforcement learning (DRL) could be applied to more fields 
miim knowledge end to end by integration of RL and neural networks. Fhere 
mea of successful neural network architectures, such as convolutional neural 
tots NN), multilayer perceptrons, recurrent neural TeEWOUS, — 
‘hlcationg « (GAN), ete., which dramatically improves the state m e zi : 
Uestendin Such as object detection, speech recognition, Te i 
Vere Previous) DRL algorithms can deal with decision making po enn Pan 
‘Rees, whic ieee with high-dimensional state input and ae 4 x DH 
Complex p eee Significant progress, and it became easie 
023 etwork model than before. 
. Expl, 
Ans, Pla. 
the ag eB Ula 


ty € Va) 
lucy the 


ja mapp" 


Bay 


in the term actor-critic algorithm. 


r Policy-gradient methods often exhibit slow convergence due 


vances of the gradient estimates. The actor-critic methods attempt 
Variance fy 


Y adopting a critic network to estimate the value of the 
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current policy, which is then used to update the actor's poli 

direction of performance improvement. The action-sclection ie Pa 
the actor mp : S —> A, which make decisions without the eae k 
procedures on a value function, mapping representation of OF opti 


selection probabilities. The value function is known a DN 


Q; -Sx A > R, which estimates the expected return to reduce Xe ci 
accelerate learning, mapping states to expected cumulative firen any 
Tew 


Fig. 4.8 shows an architecture 
design, the actor and critic are two 
separated networks share a 
common observation. At each 
step, the action selected by actor 
network is also an input factor to 
the critic network. In the process 
of policy improvement, the critic 
network estimates the state-action 
value of the current policy by Fig. 4.8 Actor-critic Network 
DQN, then actor network updates its policy in a direction improves theQ 
value. Compared with the previous pure policy-gradient methods, which do 
not have a value function, using a critic network to evaluate the current polij 
is more conducive to convergence and stability. The better the state-action 
value evaluation is, the lower the learning performance’s variance is. Itis 
important and helpful to have a better policy evaluation in the critic network 
Policy-gradient-based actor-critic algorithms are useful in many real-life 
applications because they can search for optimal policies using low-variance 
gradient estimates. Lillicrap et al. presented the DDPG algorithm, which 
_ combines the actor-critic approach with insights from DQN, tosolve simulated 
© physics tasks and it has been widely used in many robotic control tasks. It uses 
| two neural networks; the actor network learns a deterministic policy and g 
critic network approximates the Q-function of the current policy. g 

# 


Input Layer 
and Feature 
Extraction 

Layers 


SECTOR MACHINES, BAYESIAN LEARNING, 

OF MACHINE LEARNING IN COMPUTER 
‘SPEECH PROCESSING, NATURAL LANGUAGE 

ION, S Eca CASE STUDY — IMAGENET COMPETITION 


0.1, What are support vector machine ? Explain. 


Ans, Support vector machines are supervised pare prae 
[sociated learning algorithm that analyze data after which they are used for 
ussification. Classification refers to which images are related to which class 
data set or set of categories. In machine learning classification is considered 
ninstance of supervised learning which refers to task of inferring a function 
fom labelled training data. Training data in image retrieval process can be 
cectly identified images that are put in an particular class. Where each class 
ielong to different category of images. In the SVM training algorithms model 
stuild in which the new examples are assigned to one category class or 
iter In this model representation of examples in categories are done with 
‘kar gaps that are as vast as possible. 


Ts idea of SVM is to construct the hyper plane in a high dimensional 
medi a can be used for classification. Hyper plane refers to a subspace 
ish aaas than its ambient space. If there is 3-dimensional space then 
Xtieved see is the 2-dimensional planes. By hyper plane a goad separation 1s 
fas, The g hs the largest distance to the nearest training data point of any 
* margin ap ation between the hyper plane and closet data point 1S called 
eral of separation. So more the margin of separation less be the 


Zatio; 
n error of the classifier, 


nae 
etar waipa etive of the SVM machine is to find a particular hyper- 
t maximize, “argin of separation is very high or which can be controlled 
zed when this condition is met or we can under these 
Sand the, cision plane which we take to differentiate between two 
“n itis called as optimal hyper plane. The support vectors play 
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an important role in the operation of this class of learnin machi 

define support vectors as the elements of training data Set that w, Ween 
the position of the dividing hyper-plane in SVM. trainin Ould y 
removed. As maximum-margin hyper-plane and margi 
with samples from two classes and these samples on ¢ 
support vectors or we can say that these are data poin 
decision surface. 


are og 
t that lieg clos al a 
th 


sl 


Q.2. Discuss key idea of the support vector machines (SVMy), 


(R.GPy, Noy, 
hine leaming techni 
d for classifi 
ally, draw mi 
n that the di: 


Ans. The most widely used state-of-the-art mac 
Support Vector Machine (SVM). It is mainly use 
works on the principle of margin calculation. It basic 
the classes. The margins are drawn in such a fashio 
the margin and the classes is maximum and hence, 
error. The working of SVM is given in fig. 5.1. 


IG 


2013) 
we Gue ig 
Cation. SVY 
argins betwee 
stance between 


Seperating 
Hyperplane 


Support Vectors“. 


Fig. 5.1 Working of. ‘Support Vector Machine 
Input — S, 2, T, k 


Initialize — Choose WwW, st. Iw <tr 
For t= 1,2, ..... i a 


Choose A, cS, where |A. ek 


Set A} = {nye Ay y(wy,x)<} 


Set n, = xt 


n 
Sew 1 = (-mAywy tReet yx 
3 


t 


minimizing the classifica l 
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eee 
= ming! 
Set Wr + I | 


— Wy+t e . 
output = °T If we only had two features like height and hair length of 
yor Example vould first plot these two variables in two dimensional space 
I,wew 


a ndividua int has two co-ordinates, these co-ordinates are known as support 
oi j 
ere each Pi 
where 


‘ 
veciors: 


Hair Length (cm) 


150 


160 170 


Height (cm) 


180 190 200 


Fig. 5.2 Support Vector Machine 


Now, we will find some line that splits the data between the two differently 
“ussified groups of data. This will be the line such that the distances from the 
Sbsest point in each of the two groups will be farthest away. 


100 
Ẹ 80 
Š 60 
of 
5 
io 
3% 
20 
o 
140 1S0 16o 7 18o 19% 200 
Height (cm) 
tig hh the «, Fig. 5.3 Support Vector Machine Classifier 
; « . 
ae ny Pi Shown above, the line which splits the data into two 
"hese Se Sroups is the black line, since the two closest points are 
rt 


Tom the line. This line can be considered as classifier. Then, 


~ 
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f 
i 


depending on where the testing data lands on cithe 


F side of 
J thet 
what class we can classify the new data as, Tine, hu 


. „ fyperplane which pamer (i = TM) G= 

grating Jane is in the middle of the two Separating hyperplanes 

as + ine hy! Tie distance between the separating hyperplane and the 

pa j and- 1. st to the hyperplane is known as the margin. Con 

c” atum neare D) = 1 and — 1 includes at least one trainin 

a are 0 has the maximum margin for ~} <¢24. T 
p jane 


g 
Think of this algorithm as playing JezzBall in A-dimension, i 
al sp, 
tweaks in the game are * Ty 


angles (rather than iUSthog 
TZ 


sidering 
g datum, 
G) We can draw lines/planes at any 


or vertical as in classic game). 


m | Bape ei} is the generalization ti for ig main function. The 

iig : i ae $ est margin is known as the optimal separating 
ii) The objective of the game is to seme Se, {lt ith the larg h- 4 S 9 é 

ee j a objective of the game is to segregate balls OF diffrent eu wlan? eee in fig. 5-4. Two decision functions which satisfy equation 

in iffeceit ioorms l : prepianes aS Sl fig. 5.4. Hence, there are an infinite number of decision 
(ii) And the balls are not moving. y hown In 


' ai is sl hich satisfy equation (iii). The generalization ability depends on 
0.3. Write down the benefits and issues of SVM, er of the separating hyperplane. 
oc 


Ans. Benefits — There are number of benefits for usin: ie! 


ig SVM as follow 
(Gi) Itis effective in high dimensional space. 


2 
Gi) Uses a subset of training points in the 


Maximal Margin 
Support vectors). so it is also memory efficient. 


decision function (called 


(ili) It is versatile because holds di 
specified for the decision function. Co: 
also possible to specify custom kernels 


fferent kernel functions can be 
mmon kernels are Provided, but it is 

Issues — The issues that are to be faced in SVM is the hyper plane 
constructed by SVM is dependent on only a fraction of training samplescalled 
Support vectors (SVs) which recline close to the result border as well as by 
removing: some training samples that are not applicable to support vectors 
may possibly to have no effect on building the appropriate decision function. 


0 


x 


Fig. 5.4 Optimal Separating Hyperplane in a Two-dimensional Space 


Here, we must keep in mind that the optimal separating hyperplane realizes 
e tighest generalization ability from the standpoint of the VC dimension. The 
ly assumy 


ption made is that the training and test data are produced by a 
‘tgle unknown di 


% m distribution. Hence, when the outliers are included in the training 
w rthe taining data are biased from the unknown distribution, the optimal 
Parating hyperp| 


‘ane cannot realize the highest generalization ability. 
w assum 


it) 

Eurlidean -ame determining the optimal separating hyperplane. The 

tien Cr from a training datum x to the separating hyperplane is 
wil. 


Must Satisfy _ Hence, assuming the margin ô, all the training data 


0.4, Explain about optimal hyperplane. 


all) 
D&)= w'xtb 5 Mi 
where w represents an m-dimensional vector, b is a scalar and for i= |. ~~ 


241, for y;=41 Ail 
w'x; +b Ji 
S-l, for y,;=-1 


Equation (ii) is equivalent to — 


Yk D(x š 
wali Ñ Iwi 28, fork = 1, na. „M {7} 
yi(wtx; +b)21, fori=1 W OW y , 
; i at followin, W is a solution, aw is also a solution where a is a scalar. 
The hyperplane — wll 1 constraint is im; d 
D(x) =wtx + b=c, for—1<c<] Siwy =y posed — 


{vi} 
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From equations (v) and (vi), to find the oj 


Ptimal se n 
is needed to obtain w with the minimum Euclidean norm Cs 
N . . ich s 
(ili). Support vectors can be defined as data which satisfy i SS, 


equation (iii). In fig. 5.4, the data Corresponding to the fille d eja lity iy 
filled rectangle are support vectors. These data are nearest t Circles ang pe 
hyperplanes and hence are difficult to be classified. Ortha “Patan 


Eip] 
atisfię "Plane, j 


Now the optimal separating hyperplane can be obtained by Minimizi 

t to the separating hyperplanes and hence are difficult to be ae 

v the optimal separating hyperplane can be obtained by minimizing. . 
Siw iP a 

with respect to w and b subject to the constraints — 


yi(w'x;+b)21,  fori=1,...,M 


(viii) 

For the convex optimization problem, the number of variables given by 

equations (vii) and (viii) is the number of features plus one — m + 1. If the number 

of features is small, equations (vii) and (viii) can be solved by the quadratic 

programming method. If the number of features is large, equations (vii) ani 

(viii) can be converted into the equivalent dual problem whose number of 
variables is the number of training data. 


First, the constrained problem given by equations (vii) and (viii) is 
converted into the unconstrained problem as — 


Ue, ow t aT Ais) 
Qo, b, a) = 5 w'w = > oily (w xib) 1 


i=l 


ptimal solution 
where a = (a, 


mes Om)! represents Lagrange multiplier. The optim? ima 

of equation (ix) is given by the saddle point where equation (ix) is mim 

with respect to w and b and it is maximized with respect to aj(2 a P 
The optimal solution w*, b*, and &* of equation (ix) must aey 


A 
aQQw*,b*,a") o i 


ab 


lai 
EQ(w*,b*,a*) _ 5 


ow 
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cons (iX)s (x) and (xi) reduce, respectively to — 
atio! 
ysitlé P 
M 20.a; 20, fori = t., M = 
Yuin = 0.6% i 
j=l 
M , nEn r 
w*=},aiyiXio %20 fori= 1, M, (xii) 


j=l 


o the Kuhn-Tucker theorem, in equation (iii) the equality holds 
t output pair (x;, Yi) only when the associated a? is not 0. 
training data x; are the support vectors. 


according t 
„he vaining BPO 
yi condition, the A z a 
k gequations (xii) andęxīii) into equation (ix), the following dual 


gubstitutin: — 
ler iS obtained namely maximize 
M 1M 
Q(a) = Dai me i Maio; yi yjXixj ...{xiv) 
i=l i,j=l 
vith respect to subject to the constraints — 
M : 
yaj =0, 0, 20, foris l. Av) 


‘isl 
Solving equations (xiv) and (xv) for a; (i = 1......., M), the support vectors 
rclasses one and two can be obtained. Then the optimal hyperplane is 
teed at the equal distances from the support vectors for classes one and two 


ndb*is given by — 


1 a 
b* = -zwis +w's2) (xvi) 


Mere s d two 
1 and s> are arbitrary ort vectors for classes one and tw 
"Pectively, 2 supp 


TO} > ius A Pe w e r= 
equation (xiii), equation (xvi) is rewritten as given below 


1M (xvii) 
be = -5X ykakbix +shx;) sr 
k=l 
Ih th 
è s . á 

rable tig discussion, we assumed that the training data ae 
"ia i the training data are not linearly separable, we consider ie 

Mimmo Plane. In this condition, with both the maximum margin “ie 

*SSification error, we want to determine the optimal hyperpane 


are linearly 
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To permit the data that do not have the maximum f 
nonnegative slack variables &; (> 0) is introduced in equation to 
On (jij 


Ciy 
t J- St 
yil(w xj +b)21-§, foris 1, a... ,M 
For the training data x;, when 0 < &;< 1, the data donotha ~ Ori 
margin but are still correctly classified. But if ,> 1, Ve the max; 


the ny 
timal h mi are miscap 
pı a! hyperplane jn whe! 
e maximum margin iş Titi ty 
‘Um, 


by the optimal hyperplane. To obtain the o 
number of training data which do not have th 
we require to minimize as given below — 


aw) = Xog) 


izl 

l, foré >0 

where, @(&;) = 0, for &; =0. 
3 i i 


But this is a combinatorial optimization problem and is hard to solve. i 
place of, we consider minimizing — i 


1 2 M 
2 wl +C Li (i) 
subject to the constraints — 


yi(w'x; +b)>1-&), for i = 1p., M wa) 
where C represents the upper bound which determines the tradeoff between 
the maximization of margin and minimization of classification error, andisst 
to a large value. We call the achieved hyperplane soft margin hyperplane. 
Similar to the linearly separable condition, introducing the Lagang 
multipliers æ and B, we get — 


M 
M M z 
Qlw,bE aB) = Tlw CS E -S oii’ b)-1+4)- 28 
i=l i=l „o 
The conditions of optimality are given by — 

8Q(w*, b*, E*,0*,B*) bs (oxi 
ôb a 
AQCw*,b*, Eta", Bt) o (aX 

ow 
OQ(w*, b*, E*,0.*,B*) by 0 

6 


_ 


„maximize ~ 


J abject to 


Tetefor 


‘Sport vecto 


Fi 


Sin 
Í te 
| ie is 
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+) and (xxii)-(841V) reduce, respectively, to 


exi) 
ations (xx 
ge 
ysine *s0. fort = E 1A RX 
Š s =0, Qi 20, for a 
aiyi 
ia M š a: 
i one a; 200 fori =, ....,M OX} 
we 
i= 
+ pt “Wor i =a. ...... M _..(XXVH) 
pac. oi BEF? see? 
ti ing dual problem. Namely, fnd e; (i= 1, .....M) 


o get the follow! 
we get 
Hence. 


M 1 ba Pa yiyi 
= = i py yore ys 
Qu) 4 2% Te 


we MXVU) 
i=l 


the constraints — 


VESA <C (xxix) 


wh - 
P yidi =0, 


i=l 


slihissimilar to the linearly separable condition. According to Kuhn-Tucker’s 
tndition, the optimal solution satisfies — 


a(yi(w'x; +b)—1+6)) =0 wel XXX) 


Bibi = (C-ai)č; =0 
Therefore, there are three cases for (r 
ü aj=0. Then &i =0. Hence x; is correctly classified, 


„(xxxi 


@) 0 < aj<C. Then y,(wtx, +b)-1+8) =0 and %i=0. 
‘Yi (W' x; + b) = 1 and x; is a support vector. 

Gi) aj =C, Then yi(w'x; +b)-1+8; =0 and & ae T 
rand if 0<&; <1, x; is correctly classified and if & 21,x; 15 


“Selassifig, d 
For sens ‘ ion i 
E ane Parable and nonseparable conditions the decision function Is ins 
'S given by — 
Dog = (xxxii) 


M 

Sa „xt * 
LU Yyixix+b 
=I 


Gi a eran uation 
i are n jon in eg 
adde nonzero for the support vectors, the summat 


only for the support vectors. 


~~ 


EA 
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The unknown datum x is classified ag follows — 


_{elass!, if D(x)>0 


xe 
| class 2, 


otherwise. 


Hence, when training data are separable, the region 


i {x|Ts 
a generalization region. 


Days, 


0.5. Differentiate between linear and non-linear Sy; el 
sifon, 


linear (SVM 
mental ta 


ZZles, co 


Ans. The author has compared linear (LDA) vs. non 
classifiers to classify the EEG signal for five different 
signal for each task (relaxing, writing a letter, solving pu 
rotating objects) has been taken for some seconds, and each task isn 
five times. The comparison between lincar and non-linear classifi 
five different mental tasks is described below — 


We can conclude that non-linear classifier 
as compare to linear classifier. Non-linear class 
result as compare to linear classifier (LDA) fo 
EEG signal. In multilayer back Propagation n 
of selecting a proper number of hidden 

generalization model. With the help of genet 
that problem. By selecting the subset of 
feature space, genetic algorithm provides 
SVM can select a small and efficient sub: 
required electrodes will be less, so the noise will be minimized) and accurxy 
will be improved for all mental tasks. For different window sizes, error mts 
of three classifiers are calculated in each mental tasks. Window size in exh 


trial defines the time-wise classification of different mental tasks. Erorr 


p = VM. 
of SVM are less than other classifiers duetto efficient use of GA with SV 


Q.6. Describe margin and hard support vector machines (SIM) 
Ans. Let S = ny1) 
each x; e RS and y; 


tic algorithm, SVM has recovered 


better generalization approach. As 


shere 
++ (Xm, Ym) be a training set ane 
€ {1}. We say that this training set is gen all 

if there exists a half space, (w, b), such that y; = sign [(w.xi) +> 

Alternatively, this Condition can be rewritten as 
Vi € {m}, y;[Ow. x;)+b]>0 snes 
ee " M hypo 

All half spaces (w, b) that satisfy this condition are ER atl 


A A : pany SP 
(their 0-1 error 18 Zero, which is the minimum possible error). Fo! shee 


aini s of them 
training sample, there are many ERM half spaces. Which onè 
the learner pick ? 


LAY) 
sks, Brin 
unting, ang 


Epea 
ers based py 


gives better classification Tesi 
ifier specially SVM gives beter 
r high-dimensional ature of te 
etwork, NN include the problem 
units, so it cannot give. proper 


features efficiently from the large 


set of features, then the numberof 


Univ 15 
the training set deseribed in (he pictur 


jor example- 


for the black hyperplane over 


gna poin a a large margin, then it will still 
ne 
ey hyperp"® 


he training set even if we slightly perturb 
he tra 


onl t 
aih instance- error of a half space can be bounded 
“ye will se@later on that the true error of a half space can be boun, ed in 
We wi margin it has over the training sample (The larger the Margin, the 
sli) regardless of the euclidean dimension in which this half 
caller the a 
gue resides. p 
Hard-SVM is the learning rule in which we return an ERM hyperplane 
futseparates the training set with the largest possible margin. To define Hard- 


WWM formally, we first express the distance between a point x to a hyperplane 


sugthe parameters defining the half space. 


0.7, What is Bayesian learning ? Also describe the features of Bayesian 
kming methods. 


Ans, Bayesian learning is based on Bayes theorem and includes methods 


ttuilize probabilities. Existing knowledge can be incorporated in the form 
‘fnitial Probabilities, 


= Features of Bayesian learning methods are as follows — 
(i) Each obse 


: rved training example can incrementally decrease or 
. estimated probability that a hypothesis is correct, 
Provides a more flexible approach to learning than algorithms that 
z eiminate 4 hypothesis if it is found to be inconsistent with any 


"Rage 1 


A Bayesian 
He Predict 


ese! ke 
methods can accommodate hypotheses that make 
aw, (i 
Mi le 


ons, 
€w instance: 


s can be classified by combining the predictions of 
ted by their probabilities. ionally 
s Where Bayesian methods prove ponpa ti 
de a standard of optimal decision making agai 
ethods can be measured. 
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| > 
. i E CX- The class C; for which PCIX) is maxim; 
ornar knowledge can be combined with observe datay wet mize n o hypothesis. By Baye’s pecs maximized 
the final probability of a hypotheses. In Bayesian learning, Prior kat, wan maximum p P(X|C;i) P(C;) 
e i Owed: | alle AIG) EM) 
ee an eis faea p 
provided by asserting, l JA na cp 
A prior probability for each candidate hypotheses aia pCi x 
distribution over observed data for each possible hypothesis, bail, Í a as P(X) js constant for all classes, only P(X|C;) P(C;) need be 
Q.8. What are Bayesian classifiers ? Gi) 


j babilities are not known, then it j 

lass prior pro z l it is commonly 
pýnized a P asses are equally likely, that is, P(C,) = P(C3) = f 
ood that 


Ans. Bayesian classifiers are statistical classifiers. They can e d therefore maximize P(X|Cj). Otherwise, we maximiz 
gum ad We woul ne 


membership probabilities, such as the probability that a given tuple p 

a particular class. 8st 
Bayesian classification is based on Baye’s theorem. Studies liad 

classification algorithms have found a simple Bayesian classifier knownasie ; the naive assumption of class conditional independence 

naive Bayesian classifier to be comparable in performance with decision i gevluating P(XICi)- 

and selected neural network classifiers. Bayesian classifier have also exhibit 


de. This presumes that the values of the attributes are conditionally 
is made. iven the class label of the tuple. Thus 
high accuracy and speed when applied to large databases, inependent of one i : l 


Naive Bayesian classifiers assume that the effect of an attribute value o 
a given class is independent of the values of the other attributes. This as mption 
is called class conditional independence. It is made to simplify the compuiati 
involved and, in this sense, is considered ‘naive’. Bayesian belief networks. 
graphical models, which unlike naive Bayesian classifiers, allow the 


representation of dependencies among subsets of attributes. Bayesiambele! 
networks can also be used for classification. 


n 
P(X|&) = E P(x,IC;) 


= PEXIC) * PC) * a * PIC) 
(v) Inorder to predict the class label of X, P(X|C;) P(C;) is evaluated 
fireach class Cj. The classifier predicts that the class label of tuple X is the 
hss Cj ifand only if 
0.9. Explain naive Bayesian classification in detail. 
Or 
Explain naive Bays classifier with example. j 
[RGP May 2019 (VIS 


: : zg A . assier, Wor 
+ Ans. The naive Bayesian classifier or simple B- -"7n classinier, wo 
follows - 


PXC) P(C) > PCXIC;) P(C)) for 1 <j <m,j#i 


nother words, the predicted class label is the class C; for which P(X'C)) 
AG) is the maximum. 

10, i P s k z 
Ma e is Bayesian belief networks ? How does a Bayesian belief 


à Ans. Bay 
tributi 

: cee tabels lutions, 
(i) LetDbea training set of tuples and their associated aes 

x Qi 7 i i mt 
As usual, each tuple is represented by an nédimensional oT from t 
X ` h Iii Xp). depicting n measurements made on the tuple 
attributes, respectively, Ay, Ao. wu, A 


i a a rmed. Trained Bayesian belief networks can be 
or 


- c giver ’ Bayesian net ian belief networks are also known as belief 
ae se {it} Suppose that there are m classes, Cy, Co ve a ing OF wig lief network; works and probabilistic networks. 

Ae » the classifier will predict that X belongs to the ¢ ee paye? Wa aditi is defined by two components ~ a directed acyclic graph 

eke 9 pnerge Probability, conditioned on X. That is. the i A lg SHS a pet Probability tables. Each node in the directed acyclic 

‘Sr predicts that tuple X belongs to the class C; if the oY abh cd May a variable. The variables may be discrete or continuous 

A § —_— dy °8" Hers Tres ; ag 
(CX) > P(CIX) forlsjsmj#i Miden lieved to — to actual attributes given in the data or to “hidden 
` eS; n ar ae a relationship. Each are represents 2 probabilistic 


raw r z rs 
A vn from a node Y to a node Z, then Y is a parent 
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| Unit-y 157 
or immediate predecessor of Z and Z i strategy is iterative. It searches fe i 
w t, a s a desce ch a strategy > ~. ` OF a solution along 
conditionally independent of its non desteiidäit a Of Y. Bag ba | aly Hikes > gradient ofa criterion function. We want to find the — 
A belief network has one conditional prob: ve Braph, Biven irati d give of me maximize this function. To Start with, the Weights are initialized 
variable. The CPT for variable Y mE dey ability table (CPT) mr regs be obability values. The gradient descent method performs greedy 
P(Y [Parents (Y)), where P es the conditionaj qp Weg | andom PFS that, at each iteration or step along the way, the algor 
), where Parents(Y) are the parents nal distr, |e. ping in that bé haest solution at gorithm 
Let KX =(x bead. of Y. Mig | gelim what appears to be the best solution at the moment, without 
= (Xis =e» Xp) bea ata tuple described by the variabli oes towar The weights are updated at each iteration. Eventually, they 
i aa fip respectively. Each variable is Conditionally į ES O atta yotracKIE cal optimum solution. 
nondescendants in the network graph, given its y Indepen owverse pas 


: Parents. This ą Ent of i 
to provide a complete representation of the existing his allow the ne 


ent ` 11, What is Baye’s theorem ? Describe basic Probability notation, 
i oi lia AT. 
with the following equation — ? int Probability citings 


„e these probabilities estimated ? 
gow a 


Ans. Let % be a data tuple. In Bayesian terms, X is considered “evidence”, 

yal, it is described by measurements made on a set of n attributes. Let H 
Asus i ypothesis, such as that the data tuple X belongs to a specified class 
te or dassification problems, we want to determine P(H/X), the probability 
a thehypothesis H holds given the “evidence” or observed data tuple X. In 
her words, We are looking for the probability that tuple X belongs to class C, 
given that we know the attribute description of X. 

PHIX) is the posterior probability, or a posteriori probability, of H 
conditioned on X. For example, suppose our world of data tuples is confined 
v customers described by the attributes age and income, respectively, and 
hat X is a 35-year-old customer with an income of $40,000. Suppose that H 
isthe hypothesis that our customer will buy a computer. Then P(H{X) reflects 


‘he probability that customer X will buy a computer given that we know the 
Cwtomer’s age and income. 


oe i P(x;|Parents(Y; )) 


where P(X, .... Xp) is the probability of a particular combin, 


ation of vah 
and the values for P(x;|Parents(Y;)) correspond to the entries in the cre 
A node within the network can be selected as an “output” ae 


representing a class label attribute. There may be more than one Output no W 
Various algorithms for learning can be applied to the network. i 


In the learning or training of a belief network, a number of scenarios ar 
possible. The network topology may be given in advance or inferred fromthe 
data. The network variables may be observable or hidden in all orsomeof' 


training tuples. The case of hidden data is also referred tojas missing values or 
incomplete data. 


Several algorithms exist for learning the network topology from the on 
data given observable variables. If the network topology is known ais 
variables are observable, then training the network is — 2 ot 
of computing the CPT entries, as is similarly done oo ed ‘cat 
probabilities involved in naive Bayesian classification. at are vais 
topology is given and someyof the variables are — 
methods to choose from for training the belief network. 


th i contrast, P(H) is the prior probability of H. For our example, this is 
in probability that any given customer will buy a computer, regardless of age, 
swe or any other information, for that matter. The posterior probability, 


PHIX), ig rai á 
À IX), is based on more information than the prior probability, P(H). which 
independent of X. 
Simi} 
kir 


tly, P(X|H) is the posterior probability of X conditioned on H. That 


mi s A E Xip: Training i ten thee Probability that a customer, X, is 35 years old and cams $40,000, 

Let D be a training set of data RG ie of the CPT ane mx we know the customer will buy a computer. 
belief network means that we must ier nes the parents eee phe [tata is the prior probability of X. Using our example. it is the probability 
Wijk be a CPT entry for the variable T i ed as weights. a it) Py on from our set of customers is 35 years old and cams $ 40,000. 
pa H tt: lacie The set of wei ity a “alse ian no and P(X) may be estimated from the given data, as we 
P Ne A re. x icra p ` ee as ating 
ia o as W. The weights are initialized to random Pr erat” |. Posteri aye’s theorem is useful in that it provides a way of calculating 


: ac! vets | Tor j 
x lelimbing Ate? solt” | "orem, i, O7 Probability, 
gradient descent strategy performs greedy hill-cli ena i y. 


yi to a loca op 

rei ill eventually converge eP mestt 

weights are updated and wi + < used to search for the Wijk ving oF a 

A gradient descent strategy 1S user Seach possi Te = Pann 
model the data, based on the assumption | i 


P(HIX), from P(H), P(XJH) and . P(X). Baye’s 
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| ay to analyze. understand and find th -V 159 
a boai P sf Way : i € Meanie P 
Q.12. Explain the applications of machine learning, is pete? ha and sma tly. By using NLP developer ui of human 
a i k ‘a a : 
Ans. (i) Computer Vision — Many current vision syg i t” age ca cognition. entity recognition, alom oe tasks such 
% or g a : 
recognition systems, to systems that automatically classify misro fto, fire la pee? anslation, an d 
of cells, are developed using machine learning, again because fe © image poari went simulation is a technique where patients 
* ğ . m AS are pl 
systems are more accurate than a ia programs. One massing piscret? a associating each with some attribute rae modeled as an 
png op ; r ; Snes ” e S rmat i 
application of computer vision trained using machine learning is its use me sore n. plematic scenarios, etc. Natural Janguage proc Hon like age, 
US Post Office to automatically sort letters containing handwritten aq, Ye J “opmand pro to read the physician’s notes and essing is a technique 
Over 85% of handwritten mail in the US is sorted automatically, using = i ye for system rhe A a, ec lcrceat convert it to digital data 
. : . “ti ei Ci < such A 
analysis software trained to very high accuracy using machi indting f% ary predictive M “ane. predictions such as admissi 
naty ne Jea riei jst reade: jeto s ons, 
ming over, | POP -cted by hospitals to spread expertise which is in shorts 
very large data set. 2 f pepredic > 0 SAOrt supply. Disease 


meP a diagnosis is achieved by helping radiologists to make į 
diction an g gr make i 
|- 


ions with radiology data (for example — CT, MRI and R: 
decisions 


0.13. Describecase study of ImageNet competition, 

Ans. The image large scale visual recognition challenge (ILSVRC) also 
gown as ImageNet. ImageNet is a database of images used for visual 
eownition competitions. The ImageNet competition is an annual competition 
sere researchers and their teams evaluate developed algorithms as specific 
ansets, to review improvements in achieved accuracy in visual recognition 
thallenges. 


(ii) Speech Recognition — Currently available commercial s 

for speech recognition all use machine learning in one fashion or ie 
train the system to recognize speech. The reason is simple — the i 
recognition accuracy is greater if one trains the system, then if one attempis 
to program it by hand. In fact, many commercial speech recognition systems 
involve two distinct learning phases — one before the software is shipped 
(training the general system in a speaker-independent fashion), and a second 
phase after the user purchases the software (to achieve greater accuracy by 
training in a speaker-dependent fashion). 


intellectual 
adiographs), 


(iii) Bio-surveillance — A variety of government efforts to detect and 
track disease outbreaks now use machine learning. For example, the RODS 
project involves real-time collection of admissions reports to emergency rooms 
across western Pennsylvania, and the use of machine learning software to leam 
the profile of typical admissions so that it can detect anomalous patterns of 
symptoms and their geographical distribution. Current work involves adding in 
arich set of additional data, such as retail purchases of. ‘over-the-counter medicines 
to increase the information flow into the system, further increasing the need for 
automated learning methods given this even more complex data set. 


ImageNet is a dataset of over 15 million labeled high-resolutic Ne 
longing to roughly 22000 categories. The images were collected from the 
veband labeled by human labelers using Amazon's mechanical truck crowd- 
eurcing tool. Starting in 2010, as part of the Pascal visual object challenge, 
“amual competition called the ImageNet large scale visual rec 7 


allenge (ILSVRC) has been held. ILSVRC uses a subset of im 
wghly 1000 images in e 


ach of 1000 categories. In all, there are ro 

LSVRC f 06006 validation images. and 150000 testing images. 

Stblished in rg in the footsteps of the PASC AL VOC chailen ` 

"bition peed which set the precedent for standardized evaluation of 

“oc, Itsy c 2 ams in the form of yearly competitions. As in PASCAL 
i consists of two components — 

A Publically available dataset, and 


(i) An 

. annual e = 

for the d : nual competition and corresponding workshop. The 
evelļo: 


S, and the : 
ich invo fa Tess and d; he competition and workshop provide a way to tack the 
j is a fjeld that which" gis | 0a disey 3 
(v) Natural Language Processing — \t is a 1! > an “Ue ents 
i 
} 
i 


í À 
hon trains ng images. 


Gv) Robot Control - Machine Yearning methods have by 
successfully used in a number of robot systems. For example. oa 
researchers have demonstrated the use of machine learning to a ie 
strategies for stable helicopter flight and helicopter aerobatics. Ly for ovel 
Darpa-sponsored competition involving a robot driving autonomo ing v 
100 miles in the desert was won by a robot that used machine jfcollecte 
refine its ability to detect distant objects (training itself from ser up oh e 
data consisting of terrain seen initially in the distance, and seen ta s 


Pment and comparison of categorical object rece 


SS the lessons c l t successful and 
ssons learne m the most successtul a 
both computer understanding and manipulation of human end ist” | Wgyye oS Cach year ee. et 
` : rape " . > poo ion. weal. 
good in gathering new possibilities. It is mostly secu ind large PO acoro" 


ul “aanp l0 is the ont 
a 3 ot o! able q s J nly versi ` SRC fori 
or other document sets, trying to discover new patterns oF t0 ro > SO this is MOTERS VRC Ses 


hich the test set labels 
the version on which we performed most of our 


160 Machine teaming f vi-Sem.) 


ments, On Im 
yp-5 C 


Net, it is customary to report two error rales — top-} 
the fraction of test images for ah and 
the 


age! 
rror rate i$ 
the five labels considered most probable by the mod 
y el, 


yariable-resolution images, while our system requi 
a constant input dimensionality. Therefore, we down-sampled the ri ose 
fixed resolution of 256 * 256. Given a rectangular image, we first Salestuchie 
image such thal the shorter side was of length 256, and then cropped out i 
central 256 x 256 pateh from the resulting image. We did not pre process ie 
images in any other Way, except for subtracting the mean eciitiy 90% Pe 
training set from cach pixel. So we trained our network on the (centred) a 


RGB values of the pixels. 
Some example of the wi 


AlexNet. ZFNet. NIN, Goog 
ILSVRC 2017 cometition alongside Seg! 


expe 
wp-5. where the ts 


correct label is not among 


ImaueNet consists of 


nning architecture in ImageNet competition. ae 
JeNet, ResNet, SeNet. The latest winner of the 
Net, DenseNet, FractalNet. 


3€ 3 3 


i 


